Message ID | 20110413083656.12e54a91@tlielax.poochiereds.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 4/13/2011 5:36 AM, Jeff Layton wrote: > On Tue, 12 Apr 2011 10:46:00 -0700 > Badari Pulavarty<pbadari@us.ibm.com> wrote: > > >> On Tue, 2011-04-12 at 12:42 -0400, Chuck Lever wrote: >> >>> On Apr 12, 2011, at 12:15 PM, Badari Pulavarty wrote: >>> >>> >>>> On Tue, 2011-04-12 at 11:36 -0400, Chuck Lever wrote: >>>> >>>>> On Apr 12, 2011, at 11:32 AM, Badari Pulavarty wrote: >>>>> >>>>> >>>>>> Hi, >>>>>> >>>>>> We recently ran into serious performance issue with NFS client. >>>>>> It turned out that its due to lack of readv/write support for >>>>>> NFS (O_DIRECT) client. >>>>>> >>>>>> Here is our use-case: >>>>>> >>>>>> In our cloud environment, our storage is over NFS. Files >>>>>> on NFS are passed as a blockdevices to the guest (using >>>>>> O_DIRECT). When guest is doing IO on these block devices, >>>>>> they will end up as O_DIRECT writes to NFS (on KVM host). >>>>>> >>>>>> QEMU (on the host) gets a vector from virtio-ring and >>>>>> submits them. Old versions of QEMU, linearized the vector >>>>>> it got from KVM (copied them into a buffer) and submits >>>>>> the buffer. So, NFS client always received a single buffer. >>>>>> >>>>>> Later versions of QEMU, eliminated this copy and submits >>>>>> a vector directly using preadv/pwritev(). >>>>>> >>>>>> NFS client loops through the vector and submits each >>>>>> vector as separate request for each IO< wsize. In our >>>>>> case (negotiated wsize=1MB), for 256K IO - we get 64 >>>>>> vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs. >>>>>> Server end up doing each 4K synchronously. This causes >>>>>> serious performance degrade. We are trying to see if the >>>>>> performance improves if we convert IOs to ASYNC - but >>>>>> our initial results doesn't look good. >>>>>> >>>>>> readv/writev support NFS client for all possible cases is >>>>>> hard. Instead, if all vectors are page-aligned and >>>>>> iosizes page-multiple - it fits the current code easily. >>>>>> Luckily, QEMU use-case fits these requirements. >>>>>> >>>>>> Here is the patch to add this support. Comments ? >>>>>> >>>>> Restricting buffer alignment requirements would be an onerous API change, IMO. >>>>> >>>> I am not suggesting an API change at all. All I am doing is, if all >>>> the IOs are aligned - we can do a fast path as we can do in a single >>>> IO request. (as if we got a single buffer). Otherwise, we do hard >>>> way as today - loop through each one and do them individually. >>>> >>> Thanks for the clarification. That means you don't also address the problem of doing multiple small segments with FILE_SYNC writes. >>> >>> >>>>> If the NFS write path is smart enough not to set FILE_SYNC when there are multiple segments to write, then the problem should be mostly fixed. I think Jeff Layton already has a patch that does this. >>>>> >>>> We are trying that patch. It does improve the performance by little, >>>> but not anywhere close to doing it as a single vector/buffer. >>>> >>>> Khoa, can you share your performance data for all the >>>> suggestions/patches you tried so far ? >>>> >>> The individual WRITEs should be submitted in parallel. If there is additional performance overhead, it is probably due to the small RPC slot table size. Have you tried increasing it? >>> >> We haven't tried both fixes together (RPC slot increase, Turn into >> ASYNC). Each one individually didn't help much. We will try them >> together. >> >> > I must have missed some of these emails, but here's the patch that > Chuck mentioned and based on his suggestion. It may be reasonable as a > stopgap solution until Trond's overhaul of the DIO code is complete. > > With this + a larger slot table size, I would think you'd have a > substantial write performance improvement. Probably not as good as if > the writes were coalesced, but it should help. > > Badari, if you can let us know whether this plus increasing the slot > table size helps, then I'll plan to "officially" submit it. This one is > against RHEL6 but it should apply to mainline kernels with a small > offset. > > -----------------[snip]----------------- > > BZ#694309: nfs: use unstable writes for bigger groups of DIO writes > > Currently, the client uses FILE_SYNC whenever it's writing less than or > equal data to the wsize. This is a problem though if we have a bunch > of small iovec's batched up in a single writev call. The client will > iterate over them and do a single FILE_SYNC WRITE for each. > > Instead, change the code to do unstable writes when we'll need to do > multiple WRITE RPC's in order to satisfy the request. While we're at > it, optimize away the allocation of commit_data when we aren't going > to use it anyway. > > Signed-off-by: Jeff Layton<jlayton@redhat.com> > Khoa is running tests with this + RPC table change. Single thread performance improved, but multi-thread tests doesn't scale (I guess running out of RPC table entires even with 128 ?). He will share the results shortly. Thanks, Badari > --- > fs/nfs/direct.c | 13 +++++++++++-- > 1 files changed, 11 insertions(+), 2 deletions(-) > > diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c > index 398f8ed..d2ed659 100644 > --- a/fs/nfs/direct.c > +++ b/fs/nfs/direct.c > @@ -870,9 +870,18 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov, > dreq = nfs_direct_req_alloc(); > if (!dreq) > goto out; > - nfs_alloc_commit_data(dreq); > > - if (dreq->commit_data == NULL || count<= wsize) > + if (count> wsize || nr_segs> 1) > + nfs_alloc_commit_data(dreq); > + else > + dreq->commit_data = NULL; > + > + /* > + * If we couldn't allocate commit data, or we'll just be doing a > + * single write, then make this a NFS_FILE_SYNC write and do away > + * with the commit. > + */ > + if (dreq->commit_data == NULL) > sync = NFS_FILE_SYNC; > > dreq->inode = inode; > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 13 Apr 2011 06:43:53 -0700 Badari Pulavarty <pbadari@us.ibm.com> wrote: > On 4/13/2011 5:36 AM, Jeff Layton wrote: > > On Tue, 12 Apr 2011 10:46:00 -0700 > > Badari Pulavarty<pbadari@us.ibm.com> wrote: > > > > > >> On Tue, 2011-04-12 at 12:42 -0400, Chuck Lever wrote: > >> > >>> On Apr 12, 2011, at 12:15 PM, Badari Pulavarty wrote: > >>> > >>> > >>>> On Tue, 2011-04-12 at 11:36 -0400, Chuck Lever wrote: > >>>> > >>>>> On Apr 12, 2011, at 11:32 AM, Badari Pulavarty wrote: > >>>>> > >>>>> > >>>>>> Hi, > >>>>>> > >>>>>> We recently ran into serious performance issue with NFS client. > >>>>>> It turned out that its due to lack of readv/write support for > >>>>>> NFS (O_DIRECT) client. > >>>>>> > >>>>>> Here is our use-case: > >>>>>> > >>>>>> In our cloud environment, our storage is over NFS. Files > >>>>>> on NFS are passed as a blockdevices to the guest (using > >>>>>> O_DIRECT). When guest is doing IO on these block devices, > >>>>>> they will end up as O_DIRECT writes to NFS (on KVM host). > >>>>>> > >>>>>> QEMU (on the host) gets a vector from virtio-ring and > >>>>>> submits them. Old versions of QEMU, linearized the vector > >>>>>> it got from KVM (copied them into a buffer) and submits > >>>>>> the buffer. So, NFS client always received a single buffer. > >>>>>> > >>>>>> Later versions of QEMU, eliminated this copy and submits > >>>>>> a vector directly using preadv/pwritev(). > >>>>>> > >>>>>> NFS client loops through the vector and submits each > >>>>>> vector as separate request for each IO< wsize. In our > >>>>>> case (negotiated wsize=1MB), for 256K IO - we get 64 > >>>>>> vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs. > >>>>>> Server end up doing each 4K synchronously. This causes > >>>>>> serious performance degrade. We are trying to see if the > >>>>>> performance improves if we convert IOs to ASYNC - but > >>>>>> our initial results doesn't look good. > >>>>>> > >>>>>> readv/writev support NFS client for all possible cases is > >>>>>> hard. Instead, if all vectors are page-aligned and > >>>>>> iosizes page-multiple - it fits the current code easily. > >>>>>> Luckily, QEMU use-case fits these requirements. > >>>>>> > >>>>>> Here is the patch to add this support. Comments ? > >>>>>> > >>>>> Restricting buffer alignment requirements would be an onerous API change, IMO. > >>>>> > >>>> I am not suggesting an API change at all. All I am doing is, if all > >>>> the IOs are aligned - we can do a fast path as we can do in a single > >>>> IO request. (as if we got a single buffer). Otherwise, we do hard > >>>> way as today - loop through each one and do them individually. > >>>> > >>> Thanks for the clarification. That means you don't also address the problem of doing multiple small segments with FILE_SYNC writes. > >>> > >>> > >>>>> If the NFS write path is smart enough not to set FILE_SYNC when there are multiple segments to write, then the problem should be mostly fixed. I think Jeff Layton already has a patch that does this. > >>>>> > >>>> We are trying that patch. It does improve the performance by little, > >>>> but not anywhere close to doing it as a single vector/buffer. > >>>> > >>>> Khoa, can you share your performance data for all the > >>>> suggestions/patches you tried so far ? > >>>> > >>> The individual WRITEs should be submitted in parallel. If there is additional performance overhead, it is probably due to the small RPC slot table size. Have you tried increasing it? > >>> > >> We haven't tried both fixes together (RPC slot increase, Turn into > >> ASYNC). Each one individually didn't help much. We will try them > >> together. > >> > >> > > I must have missed some of these emails, but here's the patch that > > Chuck mentioned and based on his suggestion. It may be reasonable as a > > stopgap solution until Trond's overhaul of the DIO code is complete. > > > > With this + a larger slot table size, I would think you'd have a > > substantial write performance improvement. Probably not as good as if > > the writes were coalesced, but it should help. > > > > Badari, if you can let us know whether this plus increasing the slot > > table size helps, then I'll plan to "officially" submit it. This one is > > against RHEL6 but it should apply to mainline kernels with a small > > offset. > > > > -----------------[snip]----------------- > > > > BZ#694309: nfs: use unstable writes for bigger groups of DIO writes > > > > Currently, the client uses FILE_SYNC whenever it's writing less than or > > equal data to the wsize. This is a problem though if we have a bunch > > of small iovec's batched up in a single writev call. The client will > > iterate over them and do a single FILE_SYNC WRITE for each. > > > > Instead, change the code to do unstable writes when we'll need to do > > multiple WRITE RPC's in order to satisfy the request. While we're at > > it, optimize away the allocation of commit_data when we aren't going > > to use it anyway. > > > > Signed-off-by: Jeff Layton<jlayton@redhat.com> > > > > Khoa is running tests with this + RPC table change. Single thread > performance improved, > but multi-thread tests doesn't scale (I guess running out of RPC table > entires even with 128 ?). > He will share the results shortly. > > Thanks, > Badari > Ok, good to know. I tend to think that the whole slot table concept needs some reconsideration anyway. Off the top of my head... We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a mempool with a minimum number of slots. Have them all be allocated with GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on the waitqueue like it does today. Then, the clients can allocate rpc_rqst's as they need as long as memory holds out for it. We have the reserve_xprt stuff to handle congestion control anyway so I don't really see the value in the artificial limits that the slot table provides. Maybe I should hack up a patchset for this...
On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote: > We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a > mempool with a minimum number of slots. Have them all be allocated with > GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on > the waitqueue like it does today. Then, the clients can allocate > rpc_rqst's as they need as long as memory holds out for it. > > We have the reserve_xprt stuff to handle congestion control anyway so I > don't really see the value in the artificial limits that the slot table > provides. > > Maybe I should hack up a patchset for this... This issue has come up several times recently. My preference would be to tie the availability of slots to the TCP window size, and basically say that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold off allocating more slots until we get a ->write_space() callback which clears that flag. For the RDMA case, we can continue to use the current system of a fixed number of preallocated slots. Cheers Trond
On Apr 13, 2011, at 10:22 AM, Trond Myklebust wrote: > On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote: >> We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a >> mempool with a minimum number of slots. Have them all be allocated with >> GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on >> the waitqueue like it does today. Then, the clients can allocate >> rpc_rqst's as they need as long as memory holds out for it. >> >> We have the reserve_xprt stuff to handle congestion control anyway so I >> don't really see the value in the artificial limits that the slot table >> provides. >> >> Maybe I should hack up a patchset for this... > > This issue has come up several times recently. My preference would be to > tie the availability of slots to the TCP window size, and basically say > that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold > off allocating more slots until we get a ->write_space() callback which > clears that flag. I am scoping the dynamic rpc_slot allocation work and plan to prototype. -->Andy > > For the RDMA case, we can continue to use the current system of a fixed > number of preallocated slots. > > Cheers > Trond > -- > Trond Myklebust > Linux NFS client maintainer > > NetApp > Trond.Myklebust@netapp.com > www.netapp.com > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 13 Apr 2011 10:22:13 -0400 Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote: > > We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a > > mempool with a minimum number of slots. Have them all be allocated with > > GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on > > the waitqueue like it does today. Then, the clients can allocate > > rpc_rqst's as they need as long as memory holds out for it. > > > > We have the reserve_xprt stuff to handle congestion control anyway so I > > don't really see the value in the artificial limits that the slot table > > provides. > > > > Maybe I should hack up a patchset for this... > > This issue has come up several times recently. My preference would be to > tie the availability of slots to the TCP window size, and basically say > that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold > off allocating more slots until we get a ->write_space() callback which > clears that flag. > > For the RDMA case, we can continue to use the current system of a fixed > number of preallocated slots. > I take it then that we'd want a similar scheme for UDP as well? I guess I'm just not sure what the slot table is supposed to be for. Possibly naive question, and maybe you or Andy have scoped this out already... Wouldn't it make more sense to allow the code to allocate rpc_rqst's as needed, and manage congestion control in reserve_xprt ? It appears that that at least is what xprt_reserve_xprt_cong is supposed to do. The TCP variant (xprt_reserve_xprt) doesn't do that currently, but we could do it there and that would seem to make for more parity between the TCP and UDP in this sense. We could do that similarly for RDMA too. Simply keep track of how many RPCs are in flight and only allow reserving the xprt when that number hasn't crossed the max number of slots...
On Wed, 2011-04-13 at 13:20 -0400, Jeff Layton wrote: > On Wed, 13 Apr 2011 10:22:13 -0400 > Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > > On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote: > > > We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a > > > mempool with a minimum number of slots. Have them all be allocated with > > > GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on > > > the waitqueue like it does today. Then, the clients can allocate > > > rpc_rqst's as they need as long as memory holds out for it. > > > > > > We have the reserve_xprt stuff to handle congestion control anyway so I > > > don't really see the value in the artificial limits that the slot table > > > provides. > > > > > > Maybe I should hack up a patchset for this... > > > > This issue has come up several times recently. My preference would be to > > tie the availability of slots to the TCP window size, and basically say > > that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold > > off allocating more slots until we get a ->write_space() callback which > > clears that flag. > > > > For the RDMA case, we can continue to use the current system of a fixed > > number of preallocated slots. > > > > I take it then that we'd want a similar scheme for UDP as well? I guess > I'm just not sure what the slot table is supposed to be for. No. I don't see UDP as having much of a future in 10GigE and other high bandwidth environments (or even in 1GigE setups). Let's just leave it as it is... > Possibly naive question, and maybe you or Andy have scoped this out > already... > > Wouldn't it make more sense to allow the code to allocate rpc_rqst's as > needed, and manage congestion control in reserve_xprt ? It appears that > that at least is what xprt_reserve_xprt_cong is supposed to do. The TCP > variant (xprt_reserve_xprt) doesn't do that currently, but we could do > it there and that would seem to make for more parity between the TCP > and UDP in this sense. > > We could do that similarly for RDMA too. Simply keep track of how many > RPCs are in flight and only allow reserving the xprt when that number > hasn't crossed the max number of slots... What is the point of allocating a lot of resources when you lack the bandwidth to do anything with them? The reason for tying this to the TCP window size is to try to queue up as much data as we can possibly transmit, without eating too much out of the same GFP_ATOMIC pool of memory that the networking layer also uses.
On Apr 13, 2011, at 1:20 PM, Jeff Layton wrote: > On Wed, 13 Apr 2011 10:22:13 -0400 > Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > >> On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote: >>> We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a >>> mempool with a minimum number of slots. Have them all be allocated with >>> GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on >>> the waitqueue like it does today. Then, the clients can allocate >>> rpc_rqst's as they need as long as memory holds out for it. >>> >>> We have the reserve_xprt stuff to handle congestion control anyway so I >>> don't really see the value in the artificial limits that the slot table >>> provides. >>> >>> Maybe I should hack up a patchset for this... >> >> This issue has come up several times recently. My preference would be to >> tie the availability of slots to the TCP window size, and basically say >> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold >> off allocating more slots until we get a ->write_space() callback which >> clears that flag. >> >> For the RDMA case, we can continue to use the current system of a fixed >> number of preallocated slots. >> > > I take it then that we'd want a similar scheme for UDP as well? I guess > I'm just not sure what the slot table is supposed to be for. [andros] I look at the rpc_slot table as a representation of the amount of data the connection to the server can handle - basically the #slots should = double the bandwidth-delay product divided by the max(rsize/wsize). For TCP, this is the window size. (ping of max MTU packet * interface bandwidth). There is no reason to allocate more rpc_rqsts that can fit on the wire. > > Possibly naive question, and maybe you or Andy have scoped this out > already... > > Wouldn't it make more sense to allow the code to allocate rpc_rqst's as > needed, and manage congestion control in reserve_xprt ? [andros] Congestion control is not what the rpc_slot table is managing. It does need to have a minimum which experience has set at 16. It's the maximum that needs to be dynamic. Congestion control by the lower layers should work unfettered within the # of rpc_slots. Today that is not always the case when 16 slots is not enough to fill the wire, and the administrator has not changed the # of rpc_slots. > It appears that > that at least is what xprt_reserve_xprt_cong is supposed to do. The TCP > variant (xprt_reserve_xprt) doesn't do that currently, but we could do > it there and that would seem to make for more parity between the TCP > and UDP in this sense. > > We could do that similarly for RDMA too. Simply keep track of how many > RPCs are in flight and only allow reserving the xprt when that number > hasn't crossed the max number of slots... > > -- > Jeff Layton <jlayton@redhat.com> > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2011-04-13 at 13:56 -0400, Andy Adamson wrote: > On Apr 13, 2011, at 1:20 PM, Jeff Layton wrote: > > > On Wed, 13 Apr 2011 10:22:13 -0400 > > Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > > > >> On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote: > >>> We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a > >>> mempool with a minimum number of slots. Have them all be allocated with > >>> GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on > >>> the waitqueue like it does today. Then, the clients can allocate > >>> rpc_rqst's as they need as long as memory holds out for it. > >>> > >>> We have the reserve_xprt stuff to handle congestion control anyway so I > >>> don't really see the value in the artificial limits that the slot table > >>> provides. > >>> > >>> Maybe I should hack up a patchset for this... > >> > >> This issue has come up several times recently. My preference would be to > >> tie the availability of slots to the TCP window size, and basically say > >> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold > >> off allocating more slots until we get a ->write_space() callback which > >> clears that flag. > >> > >> For the RDMA case, we can continue to use the current system of a fixed > >> number of preallocated slots. > >> > > > > I take it then that we'd want a similar scheme for UDP as well? I guess > > I'm just not sure what the slot table is supposed to be for. > > [andros] I look at the rpc_slot table as a representation of the amount of data the connection to the server > can handle - basically the #slots should = double the bandwidth-delay product divided by the max(rsize/wsize). > For TCP, this is the window size. (ping of max MTU packet * interface bandwidth). > There is no reason to allocate more rpc_rqsts that can fit on the wire. Agreed, but as I said earlier, there is no reason to even try to use UDP on high bandwidth links, so I suggest we just leave it as-is. > > Possibly naive question, and maybe you or Andy have scoped this out > > already... > > > > Wouldn't it make more sense to allow the code to allocate rpc_rqst's as > > needed, and manage congestion control in reserve_xprt ? > > [andros] Congestion control is not what the rpc_slot table is managing. It does need to have > a minimum which experience has set at 16. It's the maximum that needs to be dynamic. > Congestion control by the lower layers should work unfettered within the # of rpc_slots. Today that > is not always the case when 16 slots is not enough to fill the wire, and the administrator has > not changed the # of rpc_slots. Agreed. However we do need to ensure is that the networking layer is aware that we have more data to send, and that it should negotiate a window size increase with the server if possible. To do that, we need to allocate just enough slots to hit the 'SOCK_ASYNC_NOSPACE' limit and then wait.
On Apr 13, 2011, at 2:14 PM, Trond Myklebust wrote: > On Wed, 2011-04-13 at 13:56 -0400, Andy Adamson wrote: >> On Apr 13, 2011, at 1:20 PM, Jeff Layton wrote: >> >>> On Wed, 13 Apr 2011 10:22:13 -0400 >>> Trond Myklebust <Trond.Myklebust@netapp.com> wrote: >>> >>>> On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote: >>>>> We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a >>>>> mempool with a minimum number of slots. Have them all be allocated with >>>>> GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on >>>>> the waitqueue like it does today. Then, the clients can allocate >>>>> rpc_rqst's as they need as long as memory holds out for it. >>>>> >>>>> We have the reserve_xprt stuff to handle congestion control anyway so I >>>>> don't really see the value in the artificial limits that the slot table >>>>> provides. >>>>> >>>>> Maybe I should hack up a patchset for this... >>>> >>>> This issue has come up several times recently. My preference would be to >>>> tie the availability of slots to the TCP window size, and basically say >>>> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold >>>> off allocating more slots until we get a ->write_space() callback which >>>> clears that flag. >>>> >>>> For the RDMA case, we can continue to use the current system of a fixed >>>> number of preallocated slots. >>>> >>> >>> I take it then that we'd want a similar scheme for UDP as well? I guess >>> I'm just not sure what the slot table is supposed to be for. >> >> [andros] I look at the rpc_slot table as a representation of the amount of data the connection to the server >> can handle - basically the #slots should = double the bandwidth-delay product divided by the max(rsize/wsize). >> For TCP, this is the window size. (ping of max MTU packet * interface bandwidth). >> There is no reason to allocate more rpc_rqsts that can fit on the wire. > > Agreed, but as I said earlier, there is no reason to even try to use UDP > on high bandwidth links, so I suggest we just leave it as-is. I think Jeff is suggesting that all the transports should use the same logic, but UDP and RDMA should simply have fixed upper limits on their slot table size. UDP would then behave the same as before, but would share code with the others. That might be cleaner than maintaining separate slot allocation mechanisms for each transport. In other words, share the code, but parametrize it so that UDP and RDMA have effectively fixed slot tables as before, but TCP is allowed to expand. My two pfennig's worth. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 13 Apr 2011 14:47:05 -0400 Chuck Lever <chuck.lever@oracle.com> wrote: > > On Apr 13, 2011, at 2:14 PM, Trond Myklebust wrote: > > > On Wed, 2011-04-13 at 13:56 -0400, Andy Adamson wrote: > >> On Apr 13, 2011, at 1:20 PM, Jeff Layton wrote: > >> > >>> On Wed, 13 Apr 2011 10:22:13 -0400 > >>> Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > >>> > >>>> On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote: > >>>>> We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a > >>>>> mempool with a minimum number of slots. Have them all be allocated with > >>>>> GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on > >>>>> the waitqueue like it does today. Then, the clients can allocate > >>>>> rpc_rqst's as they need as long as memory holds out for it. > >>>>> > >>>>> We have the reserve_xprt stuff to handle congestion control anyway so I > >>>>> don't really see the value in the artificial limits that the slot table > >>>>> provides. > >>>>> > >>>>> Maybe I should hack up a patchset for this... > >>>> > >>>> This issue has come up several times recently. My preference would be to > >>>> tie the availability of slots to the TCP window size, and basically say > >>>> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold > >>>> off allocating more slots until we get a ->write_space() callback which > >>>> clears that flag. > >>>> > >>>> For the RDMA case, we can continue to use the current system of a fixed > >>>> number of preallocated slots. > >>>> > >>> > >>> I take it then that we'd want a similar scheme for UDP as well? I guess > >>> I'm just not sure what the slot table is supposed to be for. > >> > >> [andros] I look at the rpc_slot table as a representation of the amount of data the connection to the server > >> can handle - basically the #slots should = double the bandwidth-delay product divided by the max(rsize/wsize). > >> For TCP, this is the window size. (ping of max MTU packet * interface bandwidth). > >> There is no reason to allocate more rpc_rqsts that can fit on the wire. > > > > Agreed, but as I said earlier, there is no reason to even try to use UDP > > on high bandwidth links, so I suggest we just leave it as-is. > > I think Jeff is suggesting that all the transports should use the same logic, but UDP and RDMA should simply have fixed upper limits on their slot table size. UDP would then behave the same as before, but would share code with the others. That might be cleaner than maintaining separate slot allocation mechanisms for each transport. > > In other words, share the code, but parametrize it so that UDP and RDMA have effectively fixed slot tables as before, but TCP is allowed to expand. > That was my initial thought, but Trond has a point that there's no reason to allocate info for a call that we're not able to send. The idea of hooking up congestion feedback from the networking layer into the slot allocation code sounds intriguing, so for now I'll stop armchair quarterbacking and just wait to see what Andy comes up with :)
>>> This issue has come up several times recently. My preference would be to >>> tie the availability of slots to the TCP window size, and basically say >>> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold >>> off allocating more slots until we get a ->write_space() callback which >>> clears that flag. >>> >>> For the RDMA case, we can continue to use the current system of a fixed >>> number of preallocated slots. >>> >> I take it then that we'd want a similar scheme for UDP as well? I guess >> I'm just not sure what the slot table is supposed to be for. > [andros] I look at the rpc_slot table as a representation of the amount of data the connection to the server > can handle - basically the #slots should = double the bandwidth-delay product divided by the max(rsize/wsize). > For TCP, this is the window size. (ping of max MTU packet * interface bandwidth). > There is no reason to allocate more rpc_rqsts that can fit on the wire. I agree with checking for space on the link. The above formula is a good lower bound on the maximum number of slots, but there are many times when a client could use more slots than the above formula. For example, we don't want to punish writes if rsize > wsize. Also, you have to account for the server memory, which can sometimes hold several write requests while waiting for them to be sync'd to disk, leaving the TCP buffers less than full. Also, I think any solution should allow admins to limit the maximum number of slots. Too many slots can increase request randomness at the server, and sometimes severely reduce performance. Dean >> Possibly naive question, and maybe you or Andy have scoped this out >> already... >> >> Wouldn't it make more sense to allow the code to allocate rpc_rqst's as >> needed, and manage congestion control in reserve_xprt ? > [andros] Congestion control is not what the rpc_slot table is managing. It does need to have > a minimum which experience has set at 16. It's the maximum that needs to be dynamic. > Congestion control by the lower layers should work unfettered within the # of rpc_slots. Today that > is not always the case when 16 slots is not enough to fill the wire, and the administrator has > not changed the # of rpc_slots. > >> It appears that >> that at least is what xprt_reserve_xprt_cong is supposed to do. The TCP >> variant (xprt_reserve_xprt) doesn't do that currently, but we could do >> it there and that would seem to make for more parity between the TCP >> and UDP in this sense. >> >> We could do that similarly for RDMA too. Simply keep track of how many >> RPCs are in flight and only allow reserving the xprt when that number >> hasn't crossed the max number of slots... > >> -- >> Jeff Layton<jlayton@redhat.com> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2011-04-13 at 17:21 -0700, Dean wrote: > >>> This issue has come up several times recently. My preference would be to > >>> tie the availability of slots to the TCP window size, and basically say > >>> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold > >>> off allocating more slots until we get a ->write_space() callback which > >>> clears that flag. > >>> > >>> For the RDMA case, we can continue to use the current system of a fixed > >>> number of preallocated slots. > >>> > >> I take it then that we'd want a similar scheme for UDP as well? I guess > >> I'm just not sure what the slot table is supposed to be for. > > [andros] I look at the rpc_slot table as a representation of the amount of data the connection to the server > > can handle - basically the #slots should = double the bandwidth-delay product divided by the max(rsize/wsize). > > For TCP, this is the window size. (ping of max MTU packet * interface bandwidth). > > There is no reason to allocate more rpc_rqsts that can fit on the wire. > I agree with checking for space on the link. > > The above formula is a good lower bound on the maximum number of slots, > but there are many times when a client could use more slots than the > above formula. For example, we don't want to punish writes if rsize > > wsize. Also, you have to account for the server memory, which can > sometimes hold several write requests while waiting for them to be > sync'd to disk, leaving the TCP buffers less than full. Err... No... On the contrary, it is a good _upper_ bound on the number of slots. There is no point in allocating a slot for an RPC request which you know you have no ability to transmit. That has nothing to do with rsize or wsize values: if the socket is backed up, it won't take more data. Trond
On 4/13/11 5:42 PM, Trond Myklebust wrote: > On Wed, 2011-04-13 at 17:21 -0700, Dean wrote: >>>>> This issue has come up several times recently. My preference would be to >>>>> tie the availability of slots to the TCP window size, and basically say >>>>> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold >>>>> off allocating more slots until we get a ->write_space() callback which >>>>> clears that flag. >>>>> >>>>> For the RDMA case, we can continue to use the current system of a fixed >>>>> number of preallocated slots. >>>>> >>>> I take it then that we'd want a similar scheme for UDP as well? I guess >>>> I'm just not sure what the slot table is supposed to be for. >>> [andros] I look at the rpc_slot table as a representation of the amount of data the connection to the server >>> can handle - basically the #slots should = double the bandwidth-delay product divided by the max(rsize/wsize). >>> For TCP, this is the window size. (ping of max MTU packet * interface bandwidth). >>> There is no reason to allocate more rpc_rqsts that can fit on the wire. >> I agree with checking for space on the link. >> >> The above formula is a good lower bound on the maximum number of slots, >> but there are many times when a client could use more slots than the >> above formula. For example, we don't want to punish writes if rsize> >> wsize. Also, you have to account for the server memory, which can >> sometimes hold several write requests while waiting for them to be >> sync'd to disk, leaving the TCP buffers less than full. > Err... No... On the contrary, it is a good _upper_ bound on the number > of slots. There is no point in allocating a slot for an RPC request > which you know you have no ability to transmit. That has nothing to do > with rsize or wsize values: if the socket is backed up, it won't take > more data. Absolutely, I'm just trying to point out that checking the SOCK_ASYNC_NOSPACE flag seems to be the only way to guarantee it won't take more data. Dean > Trond > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c index 398f8ed..d2ed659 100644 --- a/fs/nfs/direct.c +++ b/fs/nfs/direct.c @@ -870,9 +870,18 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov, dreq = nfs_direct_req_alloc(); if (!dreq) goto out; - nfs_alloc_commit_data(dreq); - if (dreq->commit_data == NULL || count <= wsize) + if (count > wsize || nr_segs > 1) + nfs_alloc_commit_data(dreq); + else + dreq->commit_data = NULL; + + /* + * If we couldn't allocate commit data, or we'll just be doing a + * single write, then make this a NFS_FILE_SYNC write and do away + * with the commit. + */ + if (dreq->commit_data == NULL) sync = NFS_FILE_SYNC; dreq->inode = inode;