Message ID | 1422477777-27933-3-git-send-email-Anna.Schumaker@Netapp.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> + p = xdr_reserve_space(xdr, 4 + 8 + 4); /* content type, offset, maxcount */ > + if (!p) > + return nfserr_resource; > + xdr_commit_encode(xdr); > + > + maxcount = svc_max_payload(resp->rqstp); > + maxcount = min_t(unsigned long, maxcount, (xdr->buf->buflen - xdr->buf->len)); > + maxcount = min_t(unsigned long, maxcount, read->rd_length); > + > + err = nfsd4_encode_readv(resp, read, file, &maxcount); If the READ_PLUS implementation can't use the splice read path it's probably a loss for most workloads. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/05/2015 09:13 AM, Christoph Hellwig wrote: >> + p = xdr_reserve_space(xdr, 4 + 8 + 4); /* content type, offset, maxcount */ >> + if (!p) >> + return nfserr_resource; >> + xdr_commit_encode(xdr); >> + >> + maxcount = svc_max_payload(resp->rqstp); >> + maxcount = min_t(unsigned long, maxcount, (xdr->buf->buflen - xdr->buf->len)); >> + maxcount = min_t(unsigned long, maxcount, read->rd_length); >> + >> + err = nfsd4_encode_readv(resp, read, file, &maxcount); > > If the READ_PLUS implementation can't use the splice read path it's > probably a loss for most workloads. > I'll play around with the splice path, but I don't think it was designed for encoding multiple read segments. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 05, 2015 at 11:06:04AM -0500, Anna Schumaker wrote: > > If the READ_PLUS implementation can't use the splice read path it's > > probably a loss for most workloads. > > > > I'll play around with the splice path, but I don't think it was designed for encoding multiple read segments. The problem is that the typical case of all data won't use splice every with your patches as the 4.2 client will always send a READ_PLUS. So we'll have to find a way to use it where it helps. While we might be able to add some hacks to only use splice for the first segment I guess we just need to make the splice support generic enough in the long run. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/05/2015 11:23 AM, Christoph Hellwig wrote: > On Thu, Feb 05, 2015 at 11:06:04AM -0500, Anna Schumaker wrote: >>> If the READ_PLUS implementation can't use the splice read path it's >>> probably a loss for most workloads. >>> >> >> I'll play around with the splice path, but I don't think it was designed for encoding multiple read segments. > > The problem is that the typical case of all data won't use splice > every with your patches as the 4.2 client will always send a READ_PLUS. > > So we'll have to find a way to use it where it helps. While we might be > able to add some hacks to only use splice for the first segment I guess > we just need to make the splice support generic enough in the long run. > I should be able to use splice if I detect that we're only returning a single DATA segment easily enough. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 05, 2015 at 11:06:04AM -0500, Anna Schumaker wrote: > On 02/05/2015 09:13 AM, Christoph Hellwig wrote: > >> + p = xdr_reserve_space(xdr, 4 + 8 + 4); /* content type, offset, maxcount */ > >> + if (!p) > >> + return nfserr_resource; > >> + xdr_commit_encode(xdr); > >> + > >> + maxcount = svc_max_payload(resp->rqstp); > >> + maxcount = min_t(unsigned long, maxcount, (xdr->buf->buflen - xdr->buf->len)); > >> + maxcount = min_t(unsigned long, maxcount, read->rd_length); > >> + > >> + err = nfsd4_encode_readv(resp, read, file, &maxcount); > > > > If the READ_PLUS implementation can't use the splice read path it's > > probably a loss for most workloads. > > > > I'll play around with the splice path, but I don't think it was designed for encoding multiple read segments. The advantage of splice is it can hand us references to page cache pages that already hold the read data, which we can in turn hand off to the network layer when we send the reply, minimizing copying. But then we need a data structure than can keep track of all the pieces of the resulting reply, which consists of that the read data plus any protocol xdr that surrounds the read data. The xdr_buf represents this fine: head, (pages[], page_base, page_len), tail. If you want to do zero-copy READ_PLUS then you need a data structure that will represent a reply with multiple data segments. I guess you could add more fields to the xdr_buf, or leave it alone and keep a list of xdr_buf's instead. The latter sounds better to me, and I imagine that would be doable but tedious--but I haven't honestly thought it through. There's also a tradeoff on the client side, isn't there? Worst case if the data to be read is a small hole followed by a bunch of data, then a READ_PLUS would end up needing to copy all the data in return for only a small savings in data transferred. I feel like READ_PLUS needs more of an argument. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 05, 2015 at 11:43:46AM -0500, Anna Schumaker wrote: > On 02/05/2015 11:23 AM, Christoph Hellwig wrote: > > On Thu, Feb 05, 2015 at 11:06:04AM -0500, Anna Schumaker wrote: > >>> If the READ_PLUS implementation can't use the splice read path it's > >>> probably a loss for most workloads. > >>> > >> > >> I'll play around with the splice path, but I don't think it was designed for encoding multiple read segments. > > > > The problem is that the typical case of all data won't use splice > > every with your patches as the 4.2 client will always send a READ_PLUS. > > > > So we'll have to find a way to use it where it helps. While we might be > > able to add some hacks to only use splice for the first segment I guess > > we just need to make the splice support generic enough in the long run. > > > > I should be able to use splice if I detect that we're only returning a single DATA segment easily enough. Oh, good thought, yes maybe that would be enough. But I still wish he had more evidence here. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 05, 2015 at 11:43:46AM -0500, Anna Schumaker wrote: > > The problem is that the typical case of all data won't use splice > > every with your patches as the 4.2 client will always send a READ_PLUS. > > > > So we'll have to find a way to use it where it helps. While we might be > > able to add some hacks to only use splice for the first segment I guess > > we just need to make the splice support generic enough in the long run. > > > > I should be able to use splice if I detect that we're only returning a single DATA segment easily enough. You could also elect to never return more than one data segment as a start: In all situations, the server may choose to return fewer bytes than specified by the client. The client needs to check for this condition and handle the condition appropriately. But doing any of these for a call that's really just an optimization soudns odd. I'd really like to see an evaluation of the READ_PLUS impact on various workloads before offering it. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 06, 2015 at 03:54:56AM -0800, Christoph Hellwig wrote: > On Thu, Feb 05, 2015 at 11:43:46AM -0500, Anna Schumaker wrote: > > > The problem is that the typical case of all data won't use splice > > > every with your patches as the 4.2 client will always send a READ_PLUS. > > > > > > So we'll have to find a way to use it where it helps. While we might be > > > able to add some hacks to only use splice for the first segment I guess > > > we just need to make the splice support generic enough in the long run. > > > > > > > I should be able to use splice if I detect that we're only returning a single DATA segment easily enough. > > You could also elect to never return more than one data segment as a > start: > > In all situations, the > server may choose to return fewer bytes than specified by the client. > The client needs to check for this condition and handle the > condition appropriately. Yeah, I think that was more-or-less what Anna's first attempt did and I said "what if that means more round trips"? The client can't anticipate the short reads so it can't make up for this with parallelism. > But doing any of these for a call that's really just an optimization > soudns odd. I'd really like to see an evaluation of the READ_PLUS > impact on various workloads before offering it. Yes, unfortunately I don't see a way to make this just an obvious win. (Is there any way we could make it so with better protocol? Maybe RDMA could help get the alignment right in multiple-segment cases? But then I think there needs to be some sort of language about RDMA, or else we're stuck with: https://tools.ietf.org/html/rfc5667#section-5 which I think forces us to return READ_PLUS data inline, another possible READ_PLUS regression.) Anyway, ignoring RDMA, I guess the interesting questions to me are: - How much does READ_PLUS actually help in the cases it was meant to? Perhaps create and age a VM image somehow, and then time a read of the entire image? - What are the worst-case regressions? Read a large file with a lot of small holes and measure time and cpu usage? --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 06, 2015 at 11:08:48AM -0500, J. Bruce Fields wrote: > On Fri, Feb 06, 2015 at 03:54:56AM -0800, Christoph Hellwig wrote: > > On Thu, Feb 05, 2015 at 11:43:46AM -0500, Anna Schumaker wrote: > > > > The problem is that the typical case of all data won't use splice > > > > every with your patches as the 4.2 client will always send a READ_PLUS. > > > > > > > > So we'll have to find a way to use it where it helps. While we might be > > > > able to add some hacks to only use splice for the first segment I guess > > > > we just need to make the splice support generic enough in the long run. > > > > > > > > > > I should be able to use splice if I detect that we're only returning a single DATA segment easily enough. > > > > You could also elect to never return more than one data segment as a > > start: > > > > In all situations, the > > server may choose to return fewer bytes than specified by the client. > > The client needs to check for this condition and handle the > > condition appropriately. > > Yeah, I think that was more-or-less what Anna's first attempt did and I > said "what if that means more round trips"? The client can't anticipate > the short reads so it can't make up for this with parallelism. > > > But doing any of these for a call that's really just an optimization > > soudns odd. I'd really like to see an evaluation of the READ_PLUS > > impact on various workloads before offering it. > > Yes, unfortunately I don't see a way to make this just an obvious win. Actually, the other fallback is to return a single data segment, which shouldn't be any worse than READ as long as the client assumes that's the normal case. So as a first pass another idea might be to return a hole *only* if it covers the entire requested range and otherwise represent the result as a single data segment? (Or you could imagine various ways to make it more complicated: e.g. also allow 1 data segment + 1 hole, also representable with an xdr buf and without screwing up the client's alignment expectations. Or allow additional data segments if they're small enough that we think the copying cost would be negligible.) And turn off READ_PLUS over RDMA, but that seems like something that should be fixed in the spec. --b. > > (Is there any way we could make it so with better protocol? Maybe RDMA > could help get the alignment right in multiple-segment cases? But then > I think there needs to be some sort of language about RDMA, or else > we're stuck with: > > https://tools.ietf.org/html/rfc5667#section-5 > > which I think forces us to return READ_PLUS data inline, another > possible READ_PLUS regression.) > > Anyway, ignoring RDMA, I guess the interesting questions to me are: > > - How much does READ_PLUS actually help in the cases it was > meant to? Perhaps create and age a VM image somehow, and then > time a read of the entire image? > - What are the worst-case regressions? Read a large file with a > lot of small holes and measure time and cpu usage? > > --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Feb 6, 2015, at 11:08 AM, J. Bruce Fields <bfields@fieldses.org> wrote: > On Fri, Feb 06, 2015 at 03:54:56AM -0800, Christoph Hellwig wrote: >> On Thu, Feb 05, 2015 at 11:43:46AM -0500, Anna Schumaker wrote: >>>> The problem is that the typical case of all data won't use splice >>>> every with your patches as the 4.2 client will always send a READ_PLUS. >>>> >>>> So we'll have to find a way to use it where it helps. While we might be >>>> able to add some hacks to only use splice for the first segment I guess >>>> we just need to make the splice support generic enough in the long run. >>>> >>> >>> I should be able to use splice if I detect that we're only returning a single DATA segment easily enough. >> >> You could also elect to never return more than one data segment as a >> start: >> >> In all situations, the >> server may choose to return fewer bytes than specified by the client. >> The client needs to check for this condition and handle the >> condition appropriately. > > Yeah, I think that was more-or-less what Anna's first attempt did and I > said "what if that means more round trips"? The client can't anticipate > the short reads so it can't make up for this with parallelism. > >> But doing any of these for a call that's really just an optimization >> soudns odd. I'd really like to see an evaluation of the READ_PLUS >> impact on various workloads before offering it. > > Yes, unfortunately I don't see a way to make this just an obvious win. I don’t think a “win” is necessary. It simply needs to be no worse than READ for current use cases. READ_PLUS should be a win for the particular use cases it was designed for (large sparsely-populated datasets). Without a demonstrated benefit I think there’s no point in keeping it. > (Is there any way we could make it so with better protocol? Maybe RDMA > could help get the alignment right in multiple-segment cases? But then > I think there needs to be some sort of language about RDMA, or else > we're stuck with: > > https://tools.ietf.org/html/rfc5667#section-5 > > which I think forces us to return READ_PLUS data inline, another > possible READ_PLUS regression.) NFSv4.2 currently does not have a binding to RPC/RDMA. It’s hard to say at this point what a READ_PLUS on RPC/RDMA might look like. RDMA clearly provides no advantage for moving a pattern that a client must re-inflate into data itself. I can guess that only the CONTENT_DATA case is interesting for RDMA bulk transfers. But don’t forget that NFSv4.1 and later don’t yet work over RDMA, thanks to missing support for bi-directional RPC/RDMA. I wouldn’t worry about special cases for it at this point. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Feb 6, 2015, at 11:46 AM, Chuck Lever <chuck.lever@oracle.com> wrote: > > On Feb 6, 2015, at 11:08 AM, J. Bruce Fields <bfields@fieldses.org> wrote: > >> On Fri, Feb 06, 2015 at 03:54:56AM -0800, Christoph Hellwig wrote: >>> On Thu, Feb 05, 2015 at 11:43:46AM -0500, Anna Schumaker wrote: >>>>> The problem is that the typical case of all data won't use splice >>>>> every with your patches as the 4.2 client will always send a READ_PLUS. >>>>> >>>>> So we'll have to find a way to use it where it helps. While we might be >>>>> able to add some hacks to only use splice for the first segment I guess >>>>> we just need to make the splice support generic enough in the long run. >>>>> >>>> >>>> I should be able to use splice if I detect that we're only returning a single DATA segment easily enough. >>> >>> You could also elect to never return more than one data segment as a >>> start: >>> >>> In all situations, the >>> server may choose to return fewer bytes than specified by the client. >>> The client needs to check for this condition and handle the >>> condition appropriately. >> >> Yeah, I think that was more-or-less what Anna's first attempt did and I >> said "what if that means more round trips"? The client can't anticipate >> the short reads so it can't make up for this with parallelism. >> >>> But doing any of these for a call that's really just an optimization >>> soudns odd. I'd really like to see an evaluation of the READ_PLUS >>> impact on various workloads before offering it. >> >> Yes, unfortunately I don't see a way to make this just an obvious win. > > I don’t think a “win” is necessary. It simply needs to be no worse than > READ for current use cases. > > READ_PLUS should be a win for the particular use cases it was > designed for (large sparsely-populated datasets). Without a > demonstrated benefit I think there’s no point in keeping it. > >> (Is there any way we could make it so with better protocol? Maybe RDMA >> could help get the alignment right in multiple-segment cases? But then >> I think there needs to be some sort of language about RDMA, or else >> we're stuck with: >> >> https://tools.ietf.org/html/rfc5667#section-5 >> >> which I think forces us to return READ_PLUS data inline, another >> possible READ_PLUS regression.) Btw, if I understand this correctly: Without a spec update, a large NFS READ_PLUS reply would be returned in a reply list, which is moved via RDMA WRITE, just like READ replies. The difference is NFS READ payload is placed directly into the client’s page cache by the adapter. With a reply list, the client transport would need to copy the returned data into the page cache. And a large reply buffer would be needed. So, slower, yes. But not inline. > NFSv4.2 currently does not have a binding to RPC/RDMA. Right, this means a spec update is needed. I agree with you, and it’s on our list. > It’s hard to > say at this point what a READ_PLUS on RPC/RDMA might look like. > > RDMA clearly provides no advantage for moving a pattern that a > client must re-inflate into data itself. I can guess that only the > CONTENT_DATA case is interesting for RDMA bulk transfers. > > But don’t forget that NFSv4.1 and later don’t yet work over RDMA, > thanks to missing support for bi-directional RPC/RDMA. I wouldn’t > worry about special cases for it at this point. > > -- > Chuck Lever > chuck[dot]lever[at]oracle[dot]com > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 06, 2015 at 12:04:13PM -0500, Chuck Lever wrote: > > On Feb 6, 2015, at 11:46 AM, Chuck Lever <chuck.lever@oracle.com> wrote: > > > > > On Feb 6, 2015, at 11:08 AM, J. Bruce Fields <bfields@fieldses.org> wrote: > > > >> On Fri, Feb 06, 2015 at 03:54:56AM -0800, Christoph Hellwig wrote: > >>> On Thu, Feb 05, 2015 at 11:43:46AM -0500, Anna Schumaker wrote: > >>>>> The problem is that the typical case of all data won't use splice > >>>>> every with your patches as the 4.2 client will always send a READ_PLUS. > >>>>> > >>>>> So we'll have to find a way to use it where it helps. While we might be > >>>>> able to add some hacks to only use splice for the first segment I guess > >>>>> we just need to make the splice support generic enough in the long run. > >>>>> > >>>> > >>>> I should be able to use splice if I detect that we're only returning a single DATA segment easily enough. > >>> > >>> You could also elect to never return more than one data segment as a > >>> start: > >>> > >>> In all situations, the > >>> server may choose to return fewer bytes than specified by the client. > >>> The client needs to check for this condition and handle the > >>> condition appropriately. > >> > >> Yeah, I think that was more-or-less what Anna's first attempt did and I > >> said "what if that means more round trips"? The client can't anticipate > >> the short reads so it can't make up for this with parallelism. > >> > >>> But doing any of these for a call that's really just an optimization > >>> soudns odd. I'd really like to see an evaluation of the READ_PLUS > >>> impact on various workloads before offering it. > >> > >> Yes, unfortunately I don't see a way to make this just an obvious win. > > > > I don’t think a “win” is necessary. It simply needs to be no worse than > > READ for current use cases. > > > > READ_PLUS should be a win for the particular use cases it was > > designed for (large sparsely-populated datasets). Without a > > demonstrated benefit I think there’s no point in keeping it. > > > >> (Is there any way we could make it so with better protocol? Maybe RDMA > >> could help get the alignment right in multiple-segment cases? But then > >> I think there needs to be some sort of language about RDMA, or else > >> we're stuck with: > >> > >> https://tools.ietf.org/html/rfc5667#section-5 > >> > >> which I think forces us to return READ_PLUS data inline, another > >> possible READ_PLUS regression.) > > Btw, if I understand this correctly: > > Without a spec update, a large NFS READ_PLUS reply would be returned > in a reply list, which is moved via RDMA WRITE, just like READ > replies. > > The difference is NFS READ payload is placed directly into the > client’s page cache by the adapter. With a reply list, the client > transport would need to copy the returned data into the page cache. > And a large reply buffer would be needed. > > So, slower, yes. But not inline. I'm not very good at this, bear with me, but: the above-referenced section doesn't talk about "reply lists", only "write lists", and only explains how to use write lists for READ and READLINK data, and seems to expect everything else to be sent inline. > > NFSv4.2 currently does not have a binding to RPC/RDMA. > > Right, this means a spec update is needed. I agree with you, and > it’s on our list. OK, so that would go in some kind of update to 5667 rather than in the minor version 2 spec? Discussing this in the READ_PLUS description would also seem helpful to me, but OK I don't really have a strong opinion. --b. > > > It’s hard to > > say at this point what a READ_PLUS on RPC/RDMA might look like. > > > > RDMA clearly provides no advantage for moving a pattern that a > > client must re-inflate into data itself. I can guess that only the > > CONTENT_DATA case is interesting for RDMA bulk transfers. > > > > But don’t forget that NFSv4.1 and later don’t yet work over RDMA, > > thanks to missing support for bi-directional RPC/RDMA. I wouldn’t > > worry about special cases for it at this point. > > > > -- > > Chuck Lever > > chuck[dot]lever[at]oracle[dot]com > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Chuck Lever > chuck[dot]lever[at]oracle[dot]com > > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Feb 6, 2015, at 12:59 PM, J. Bruce Fields <bfields@fieldses.org> wrote: > On Fri, Feb 06, 2015 at 12:04:13PM -0500, Chuck Lever wrote: >> >> On Feb 6, 2015, at 11:46 AM, Chuck Lever <chuck.lever@oracle.com> wrote: >> >>> >>> On Feb 6, 2015, at 11:08 AM, J. Bruce Fields <bfields@fieldses.org> wrote: >>> >>>> On Fri, Feb 06, 2015 at 03:54:56AM -0800, Christoph Hellwig wrote: >>>>> On Thu, Feb 05, 2015 at 11:43:46AM -0500, Anna Schumaker wrote: >>>>>>> The problem is that the typical case of all data won't use splice >>>>>>> every with your patches as the 4.2 client will always send a READ_PLUS. >>>>>>> >>>>>>> So we'll have to find a way to use it where it helps. While we might be >>>>>>> able to add some hacks to only use splice for the first segment I guess >>>>>>> we just need to make the splice support generic enough in the long run. >>>>>>> >>>>>> >>>>>> I should be able to use splice if I detect that we're only returning a single DATA segment easily enough. >>>>> >>>>> You could also elect to never return more than one data segment as a >>>>> start: >>>>> >>>>> In all situations, the >>>>> server may choose to return fewer bytes than specified by the client. >>>>> The client needs to check for this condition and handle the >>>>> condition appropriately. >>>> >>>> Yeah, I think that was more-or-less what Anna's first attempt did and I >>>> said "what if that means more round trips"? The client can't anticipate >>>> the short reads so it can't make up for this with parallelism. >>>> >>>>> But doing any of these for a call that's really just an optimization >>>>> soudns odd. I'd really like to see an evaluation of the READ_PLUS >>>>> impact on various workloads before offering it. >>>> >>>> Yes, unfortunately I don't see a way to make this just an obvious win. >>> >>> I don’t think a “win” is necessary. It simply needs to be no worse than >>> READ for current use cases. >>> >>> READ_PLUS should be a win for the particular use cases it was >>> designed for (large sparsely-populated datasets). Without a >>> demonstrated benefit I think there’s no point in keeping it. >>> >>>> (Is there any way we could make it so with better protocol? Maybe RDMA >>>> could help get the alignment right in multiple-segment cases? But then >>>> I think there needs to be some sort of language about RDMA, or else >>>> we're stuck with: >>>> >>>> https://tools.ietf.org/html/rfc5667#section-5 >>>> >>>> which I think forces us to return READ_PLUS data inline, another >>>> possible READ_PLUS regression.) >> >> Btw, if I understand this correctly: >> >> Without a spec update, a large NFS READ_PLUS reply would be returned >> in a reply list, which is moved via RDMA WRITE, just like READ >> replies. >> >> The difference is NFS READ payload is placed directly into the >> client’s page cache by the adapter. With a reply list, the client >> transport would need to copy the returned data into the page cache. >> And a large reply buffer would be needed. >> >> So, slower, yes. But not inline. > > I'm not very good at this, bear with me, but: the above-referenced > section doesn't talk about "reply lists", only "write lists", and only > explains how to use write lists for READ and READLINK data, and seems to expect everything else to be sent inline. I may have some details wrong, but this is my understanding. Small replies are sent inline. There is a size maximum for inline messages, however. I guess 5667 section 5 assumes this context, which appears throughout RFC 5666. If an expected reply exceeds the inline size, then a client will set up a reply list for the server. A memory region on the client is registered as a target for RDMA WRITE operations, and the co-ordinates of that region are sent to the server in the RPC call. If the server finds the reply will indeed be larger than the inline maximum, it plants the reply in the client memory region described by the request’s reply list, and repeats the co-ordinates of that region back to the client in the RPC reply. A server may also choose to send a small reply inline, even if the client provided a reply list. In that case, the server does not repeat the reply list in the reply, and the full reply appears inline. Linux registers part of the RPC reply buffer for the reply list. After it is received on the client, the reply payload is copied by the client CPU to its final destination. Inline and reply list are the mechanisms used when the upper layer has some processing to do to the incoming data (eg READDIR). When a request just needs raw data to be simply dropped off in the client’s memory, then the write list is preferred. A write list is basically a zero-copy I/O. But these choices are fixed by the specified RPC/RDMA binding of the upper layer protocol (that’s what RFC 5667 is). NFS READ and READLINK are the only NFS operations allowed to use a write list. (NFSv4 compounds are somewhat ambiguous, and that too needs to be addressed). As READ_PLUS conveys both kinds of data (zero-copy and data that might require some processing) IMO RFC 5667 does not provide adequate guidance about how to convey READ_PLUS. It will need to be added somewhere. >>> NFSv4.2 currently does not have a binding to RPC/RDMA. >> >> Right, this means a spec update is needed. I agree with you, and >> it’s on our list. > > OK, so that would go in some kind of update to 5667 rather than in the > minor version 2 spec? The WG has to decide whether an update to 5667 or a new document will be the ultimate vehicle. > Discussing this in the READ_PLUS description would also seem helpful to > me, but OK I don't really have a strong opinion. If there is a precedent, it’s probably that the RPC/RDMA binding is specified in a separate document. I suspect there won’t be much appetite for holding up NFSv4.2 for an RPC/RDMA binding. > --b. > >> >>> It’s hard to >>> say at this point what a READ_PLUS on RPC/RDMA might look like. >>> >>> RDMA clearly provides no advantage for moving a pattern that a >>> client must re-inflate into data itself. I can guess that only the >>> CONTENT_DATA case is interesting for RDMA bulk transfers. >>> >>> But don’t forget that NFSv4.1 and later don’t yet work over RDMA, >>> thanks to missing support for bi-directional RPC/RDMA. I wouldn’t >>> worry about special cases for it at this point. >>> >>> -- >>> Chuck Lever >>> chuck[dot]lever[at]oracle[dot]com >>> >>> >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- >> Chuck Lever >> chuck[dot]lever[at]oracle[dot]com >> >> -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 06, 2015 at 01:44:15PM -0500, Chuck Lever wrote: > > On Feb 6, 2015, at 12:59 PM, J. Bruce Fields <bfields@fieldses.org> wrote: > > > On Fri, Feb 06, 2015 at 12:04:13PM -0500, Chuck Lever wrote: > >> > >> On Feb 6, 2015, at 11:46 AM, Chuck Lever <chuck.lever@oracle.com> wrote: > >> > >>> > >>> On Feb 6, 2015, at 11:08 AM, J. Bruce Fields <bfields@fieldses.org> wrote: > >>> > >>>> On Fri, Feb 06, 2015 at 03:54:56AM -0800, Christoph Hellwig wrote: > >>>>> On Thu, Feb 05, 2015 at 11:43:46AM -0500, Anna Schumaker wrote: > >>>>>>> The problem is that the typical case of all data won't use splice > >>>>>>> every with your patches as the 4.2 client will always send a READ_PLUS. > >>>>>>> > >>>>>>> So we'll have to find a way to use it where it helps. While we might be > >>>>>>> able to add some hacks to only use splice for the first segment I guess > >>>>>>> we just need to make the splice support generic enough in the long run. > >>>>>>> > >>>>>> > >>>>>> I should be able to use splice if I detect that we're only returning a single DATA segment easily enough. > >>>>> > >>>>> You could also elect to never return more than one data segment as a > >>>>> start: > >>>>> > >>>>> In all situations, the > >>>>> server may choose to return fewer bytes than specified by the client. > >>>>> The client needs to check for this condition and handle the > >>>>> condition appropriately. > >>>> > >>>> Yeah, I think that was more-or-less what Anna's first attempt did and I > >>>> said "what if that means more round trips"? The client can't anticipate > >>>> the short reads so it can't make up for this with parallelism. > >>>> > >>>>> But doing any of these for a call that's really just an optimization > >>>>> soudns odd. I'd really like to see an evaluation of the READ_PLUS > >>>>> impact on various workloads before offering it. > >>>> > >>>> Yes, unfortunately I don't see a way to make this just an obvious win. > >>> > >>> I don’t think a “win” is necessary. It simply needs to be no worse than > >>> READ for current use cases. > >>> > >>> READ_PLUS should be a win for the particular use cases it was > >>> designed for (large sparsely-populated datasets). Without a > >>> demonstrated benefit I think there’s no point in keeping it. > >>> > >>>> (Is there any way we could make it so with better protocol? Maybe RDMA > >>>> could help get the alignment right in multiple-segment cases? But then > >>>> I think there needs to be some sort of language about RDMA, or else > >>>> we're stuck with: > >>>> > >>>> https://tools.ietf.org/html/rfc5667#section-5 > >>>> > >>>> which I think forces us to return READ_PLUS data inline, another > >>>> possible READ_PLUS regression.) > >> > >> Btw, if I understand this correctly: > >> > >> Without a spec update, a large NFS READ_PLUS reply would be returned > >> in a reply list, which is moved via RDMA WRITE, just like READ > >> replies. > >> > >> The difference is NFS READ payload is placed directly into the > >> client’s page cache by the adapter. With a reply list, the client > >> transport would need to copy the returned data into the page cache. > >> And a large reply buffer would be needed. > >> > >> So, slower, yes. But not inline. > > > > I'm not very good at this, bear with me, but: the above-referenced > > section doesn't talk about "reply lists", only "write lists", and only > > explains how to use write lists for READ and READLINK data, and seems to expect everything else to be sent inline. > > I may have some details wrong, but this is my understanding. > > Small replies are sent inline. There is a size maximum for inline > messages, however. I guess 5667 section 5 assumes this context, which > appears throughout RFC 5666. > > If an expected reply exceeds the inline size, then a client will > set up a reply list for the server. A memory region on the client is > registered as a target for RDMA WRITE operations, and the co-ordinates > of that region are sent to the server in the RPC call. > > If the server finds the reply will indeed be larger than the inline > maximum, it plants the reply in the client memory region described by > the request’s reply list, and repeats the co-ordinates of that region > back to the client in the RPC reply. > > A server may also choose to send a small reply inline, even if the > client provided a reply list. In that case, the server does not > repeat the reply list in the reply, and the full reply appears > inline. > > Linux registers part of the RPC reply buffer for the reply list. After > it is received on the client, the reply payload is copied by the client > CPU to its final destination. > > Inline and reply list are the mechanisms used when the upper layer > has some processing to do to the incoming data (eg READDIR). When > a request just needs raw data to be simply dropped off in the client’s > memory, then the write list is preferred. A write list is basically a > zero-copy I/O. The term "reply list" doesn't appear in either RFC. I believe you mean "client-posted write list" in most of the above, except this last paragraph, which should have started with "Inline and server-posted read list..." ? > But these choices are fixed by the specified RPC/RDMA binding of the > upper layer protocol (that’s what RFC 5667 is). NFS READ and READLINK > are the only NFS operations allowed to use a write list. (NFSv4 > compounds are somewhat ambiguous, and that too needs to be addressed). > > As READ_PLUS conveys both kinds of data (zero-copy and data that > might require some processing) IMO RFC 5667 does not provide adequate > guidance about how to convey READ_PLUS. It will need to be added > somewhere. OK, good. I wonder how it would do this. The best the client could do, I guess, is provide the same write list it would for a READ of the same extent. Could the server then write just the pieces of that extent it needs to, send the hole information inline, and leave it to the client to do any necessary zeroing? (And is any of this worth it?) > >>> NFSv4.2 currently does not have a binding to RPC/RDMA. > >> > >> Right, this means a spec update is needed. I agree with you, and > >> it’s on our list. > > > > OK, so that would go in some kind of update to 5667 rather than in the > > minor version 2 spec? > > The WG has to decide whether an update to 5667 or a new document will > be the ultimate vehicle. > > > Discussing this in the READ_PLUS description would also seem helpful to > > me, but OK I don't really have a strong opinion. > > If there is a precedent, it’s probably that the RPC/RDMA binding is > specified in a separate document. I suspect there won’t be much > appetite for holding up NFSv4.2 for an RPC/RDMA binding. Alright, I guess that makes sense, thanks. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Feb 6, 2015, at 2:35 PM, J. Bruce Fields <bfields@fieldses.org> wrote: > On Fri, Feb 06, 2015 at 01:44:15PM -0500, Chuck Lever wrote: >> >> On Feb 6, 2015, at 12:59 PM, J. Bruce Fields <bfields@fieldses.org> wrote: >> >>> On Fri, Feb 06, 2015 at 12:04:13PM -0500, Chuck Lever wrote: >>>> >>>> On Feb 6, 2015, at 11:46 AM, Chuck Lever <chuck.lever@oracle.com> wrote: >>>> >>>>> >>>>> On Feb 6, 2015, at 11:08 AM, J. Bruce Fields <bfields@fieldses.org> wrote: >>>>> >>>>>> On Fri, Feb 06, 2015 at 03:54:56AM -0800, Christoph Hellwig wrote: >>>>>>> On Thu, Feb 05, 2015 at 11:43:46AM -0500, Anna Schumaker wrote: >>>>>>>>> The problem is that the typical case of all data won't use splice >>>>>>>>> every with your patches as the 4.2 client will always send a READ_PLUS. >>>>>>>>> >>>>>>>>> So we'll have to find a way to use it where it helps. While we might be >>>>>>>>> able to add some hacks to only use splice for the first segment I guess >>>>>>>>> we just need to make the splice support generic enough in the long run. >>>>>>>>> >>>>>>>> >>>>>>>> I should be able to use splice if I detect that we're only returning a single DATA segment easily enough. >>>>>>> >>>>>>> You could also elect to never return more than one data segment as a >>>>>>> start: >>>>>>> >>>>>>> In all situations, the >>>>>>> server may choose to return fewer bytes than specified by the client. >>>>>>> The client needs to check for this condition and handle the >>>>>>> condition appropriately. >>>>>> >>>>>> Yeah, I think that was more-or-less what Anna's first attempt did and I >>>>>> said "what if that means more round trips"? The client can't anticipate >>>>>> the short reads so it can't make up for this with parallelism. >>>>>> >>>>>>> But doing any of these for a call that's really just an optimization >>>>>>> soudns odd. I'd really like to see an evaluation of the READ_PLUS >>>>>>> impact on various workloads before offering it. >>>>>> >>>>>> Yes, unfortunately I don't see a way to make this just an obvious win. >>>>> >>>>> I don’t think a “win” is necessary. It simply needs to be no worse than >>>>> READ for current use cases. >>>>> >>>>> READ_PLUS should be a win for the particular use cases it was >>>>> designed for (large sparsely-populated datasets). Without a >>>>> demonstrated benefit I think there’s no point in keeping it. >>>>> >>>>>> (Is there any way we could make it so with better protocol? Maybe RDMA >>>>>> could help get the alignment right in multiple-segment cases? But then >>>>>> I think there needs to be some sort of language about RDMA, or else >>>>>> we're stuck with: >>>>>> >>>>>> https://tools.ietf.org/html/rfc5667#section-5 >>>>>> >>>>>> which I think forces us to return READ_PLUS data inline, another >>>>>> possible READ_PLUS regression.) >>>> >>>> Btw, if I understand this correctly: >>>> >>>> Without a spec update, a large NFS READ_PLUS reply would be returned >>>> in a reply list, which is moved via RDMA WRITE, just like READ >>>> replies. >>>> >>>> The difference is NFS READ payload is placed directly into the >>>> client’s page cache by the adapter. With a reply list, the client >>>> transport would need to copy the returned data into the page cache. >>>> And a large reply buffer would be needed. >>>> >>>> So, slower, yes. But not inline. >>> >>> I'm not very good at this, bear with me, but: the above-referenced >>> section doesn't talk about "reply lists", only "write lists", and only >>> explains how to use write lists for READ and READLINK data, and seems to expect everything else to be sent inline. >> >> I may have some details wrong, but this is my understanding. >> >> Small replies are sent inline. There is a size maximum for inline >> messages, however. I guess 5667 section 5 assumes this context, which >> appears throughout RFC 5666. >> >> If an expected reply exceeds the inline size, then a client will >> set up a reply list for the server. A memory region on the client is >> registered as a target for RDMA WRITE operations, and the co-ordinates >> of that region are sent to the server in the RPC call. >> >> If the server finds the reply will indeed be larger than the inline >> maximum, it plants the reply in the client memory region described by >> the request’s reply list, and repeats the co-ordinates of that region >> back to the client in the RPC reply. >> >> A server may also choose to send a small reply inline, even if the >> client provided a reply list. In that case, the server does not >> repeat the reply list in the reply, and the full reply appears >> inline. >> >> Linux registers part of the RPC reply buffer for the reply list. After >> it is received on the client, the reply payload is copied by the client >> CPU to its final destination. >> >> Inline and reply list are the mechanisms used when the upper layer >> has some processing to do to the incoming data (eg READDIR). When >> a request just needs raw data to be simply dropped off in the client’s >> memory, then the write list is preferred. A write list is basically a >> zero-copy I/O. > > The term "reply list" doesn't appear in either RFC. I believe you mean > "client-posted write list" in most of the above, except this last > paragraph, which should have started with "Inline and server-posted read list...” ? No, I meant “reply list.” Definitely not read list. The terms used in the RFCs and the implementations vary, unfortunately, and only the read list is an actual list. The write and reply lists are actually two separate counted arrays that are both expressed using xdr_write_list. Have a look at RFC 5666, section 5.2, where it is referred to as either a “long reply” or a “reply chunk.” >> But these choices are fixed by the specified RPC/RDMA binding of the >> upper layer protocol (that’s what RFC 5667 is). NFS READ and READLINK >> are the only NFS operations allowed to use a write list. (NFSv4 >> compounds are somewhat ambiguous, and that too needs to be addressed). >> >> As READ_PLUS conveys both kinds of data (zero-copy and data that >> might require some processing) IMO RFC 5667 does not provide adequate >> guidance about how to convey READ_PLUS. It will need to be added >> somewhere. > > OK, good. I wonder how it would do this. The best the client could do, > I guess, is provide the same write list it would for a READ of the same > extent. Could the server then write just the pieces of that extent it > needs to, send the hole information inline, and leave it to the client > to do any necessary zeroing? (And is any of this worth it?) Conveying large data payloads using zero-copy techniques should be beneficial. Since hole information could appear in a reply list if it were large, and thus would not be inline, technically speaking, the best we can say is that hole information wouldn’t be eligible for the write list. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 06, 2015 at 03:07:08PM -0500, Chuck Lever wrote: > > On Feb 6, 2015, at 2:35 PM, J. Bruce Fields <bfields@fieldses.org> wrote: > > > On Fri, Feb 06, 2015 at 01:44:15PM -0500, Chuck Lever wrote: > >> > >> On Feb 6, 2015, at 12:59 PM, J. Bruce Fields <bfields@fieldses.org> wrote: > >> > >>> On Fri, Feb 06, 2015 at 12:04:13PM -0500, Chuck Lever wrote: > >>>> > >>>> On Feb 6, 2015, at 11:46 AM, Chuck Lever <chuck.lever@oracle.com> wrote: > >>>> > >>>>> > >>>>> On Feb 6, 2015, at 11:08 AM, J. Bruce Fields <bfields@fieldses.org> wrote: > >>>>> > >>>>>> On Fri, Feb 06, 2015 at 03:54:56AM -0800, Christoph Hellwig wrote: > >>>>>>> On Thu, Feb 05, 2015 at 11:43:46AM -0500, Anna Schumaker wrote: > >>>>>>>>> The problem is that the typical case of all data won't use splice > >>>>>>>>> every with your patches as the 4.2 client will always send a READ_PLUS. > >>>>>>>>> > >>>>>>>>> So we'll have to find a way to use it where it helps. While we might be > >>>>>>>>> able to add some hacks to only use splice for the first segment I guess > >>>>>>>>> we just need to make the splice support generic enough in the long run. > >>>>>>>>> > >>>>>>>> > >>>>>>>> I should be able to use splice if I detect that we're only returning a single DATA segment easily enough. > >>>>>>> > >>>>>>> You could also elect to never return more than one data segment as a > >>>>>>> start: > >>>>>>> > >>>>>>> In all situations, the > >>>>>>> server may choose to return fewer bytes than specified by the client. > >>>>>>> The client needs to check for this condition and handle the > >>>>>>> condition appropriately. > >>>>>> > >>>>>> Yeah, I think that was more-or-less what Anna's first attempt did and I > >>>>>> said "what if that means more round trips"? The client can't anticipate > >>>>>> the short reads so it can't make up for this with parallelism. > >>>>>> > >>>>>>> But doing any of these for a call that's really just an optimization > >>>>>>> soudns odd. I'd really like to see an evaluation of the READ_PLUS > >>>>>>> impact on various workloads before offering it. > >>>>>> > >>>>>> Yes, unfortunately I don't see a way to make this just an obvious win. > >>>>> > >>>>> I don’t think a “win” is necessary. It simply needs to be no worse than > >>>>> READ for current use cases. > >>>>> > >>>>> READ_PLUS should be a win for the particular use cases it was > >>>>> designed for (large sparsely-populated datasets). Without a > >>>>> demonstrated benefit I think there’s no point in keeping it. > >>>>> > >>>>>> (Is there any way we could make it so with better protocol? Maybe RDMA > >>>>>> could help get the alignment right in multiple-segment cases? But then > >>>>>> I think there needs to be some sort of language about RDMA, or else > >>>>>> we're stuck with: > >>>>>> > >>>>>> https://tools.ietf.org/html/rfc5667#section-5 > >>>>>> > >>>>>> which I think forces us to return READ_PLUS data inline, another > >>>>>> possible READ_PLUS regression.) > >>>> > >>>> Btw, if I understand this correctly: > >>>> > >>>> Without a spec update, a large NFS READ_PLUS reply would be returned > >>>> in a reply list, which is moved via RDMA WRITE, just like READ > >>>> replies. > >>>> > >>>> The difference is NFS READ payload is placed directly into the > >>>> client’s page cache by the adapter. With a reply list, the client > >>>> transport would need to copy the returned data into the page cache. > >>>> And a large reply buffer would be needed. > >>>> > >>>> So, slower, yes. But not inline. > >>> > >>> I'm not very good at this, bear with me, but: the above-referenced > >>> section doesn't talk about "reply lists", only "write lists", and only > >>> explains how to use write lists for READ and READLINK data, and seems to expect everything else to be sent inline. > >> > >> I may have some details wrong, but this is my understanding. > >> > >> Small replies are sent inline. There is a size maximum for inline > >> messages, however. I guess 5667 section 5 assumes this context, which > >> appears throughout RFC 5666. > >> > >> If an expected reply exceeds the inline size, then a client will > >> set up a reply list for the server. A memory region on the client is > >> registered as a target for RDMA WRITE operations, and the co-ordinates > >> of that region are sent to the server in the RPC call. > >> > >> If the server finds the reply will indeed be larger than the inline > >> maximum, it plants the reply in the client memory region described by > >> the request’s reply list, and repeats the co-ordinates of that region > >> back to the client in the RPC reply. > >> > >> A server may also choose to send a small reply inline, even if the > >> client provided a reply list. In that case, the server does not > >> repeat the reply list in the reply, and the full reply appears > >> inline. > >> > >> Linux registers part of the RPC reply buffer for the reply list. After > >> it is received on the client, the reply payload is copied by the client > >> CPU to its final destination. > >> > >> Inline and reply list are the mechanisms used when the upper layer > >> has some processing to do to the incoming data (eg READDIR). When > >> a request just needs raw data to be simply dropped off in the client’s > >> memory, then the write list is preferred. A write list is basically a > >> zero-copy I/O. > > > > The term "reply list" doesn't appear in either RFC. I believe you mean > > "client-posted write list" in most of the above, except this last > > paragraph, which should have started with "Inline and server-posted read list...” ? > > No, I meant “reply list.” Definitely not read list. > > The terms used in the RFCs and the implementations vary, OK. Would you mind defining the term "reply list" for me? Google's not helping. --b. > unfortunately, and only the read list is an actual list. The write and > reply lists are actually two separate counted arrays that are both > expressed using xdr_write_list. > > Have a look at RFC 5666, section 5.2, where it is referred to as > either a “long reply” or a “reply chunk.” > > >> But these choices are fixed by the specified RPC/RDMA binding of the > >> upper layer protocol (that’s what RFC 5667 is). NFS READ and READLINK > >> are the only NFS operations allowed to use a write list. (NFSv4 > >> compounds are somewhat ambiguous, and that too needs to be addressed). > >> > >> As READ_PLUS conveys both kinds of data (zero-copy and data that > >> might require some processing) IMO RFC 5667 does not provide adequate > >> guidance about how to convey READ_PLUS. It will need to be added > >> somewhere. > > > > OK, good. I wonder how it would do this. The best the client could do, > > I guess, is provide the same write list it would for a READ of the same > > extent. Could the server then write just the pieces of that extent it > > needs to, send the hole information inline, and leave it to the client > > to do any necessary zeroing? (And is any of this worth it?) > > Conveying large data payloads using zero-copy techniques should be > beneficial. > > Since hole information could appear in a reply list if it were large, > and thus would not be inline, technically speaking, the best we can > say is that hole information wouldn’t be eligible for the write list. > > -- > Chuck Lever > chuck[dot]lever[at]oracle[dot]com > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Feb 6, 2015, at 3:28 PM, J. Bruce Fields <bfields@fieldses.org> wrote: > On Fri, Feb 06, 2015 at 03:07:08PM -0500, Chuck Lever wrote: >> >> On Feb 6, 2015, at 2:35 PM, J. Bruce Fields <bfields@fieldses.org> wrote: >>>> >>>> Small replies are sent inline. There is a size maximum for inline >>>> messages, however. I guess 5667 section 5 assumes this context, which >>>> appears throughout RFC 5666. >>>> >>>> If an expected reply exceeds the inline size, then a client will >>>> set up a reply list for the server. A memory region on the client is >>>> registered as a target for RDMA WRITE operations, and the co-ordinates >>>> of that region are sent to the server in the RPC call. >>>> >>>> If the server finds the reply will indeed be larger than the inline >>>> maximum, it plants the reply in the client memory region described by >>>> the request’s reply list, and repeats the co-ordinates of that region >>>> back to the client in the RPC reply. >>>> >>>> A server may also choose to send a small reply inline, even if the >>>> client provided a reply list. In that case, the server does not >>>> repeat the reply list in the reply, and the full reply appears >>>> inline. >>>> >>>> Linux registers part of the RPC reply buffer for the reply list. After >>>> it is received on the client, the reply payload is copied by the client >>>> CPU to its final destination. >>>> >>>> Inline and reply list are the mechanisms used when the upper layer >>>> has some processing to do to the incoming data (eg READDIR). When >>>> a request just needs raw data to be simply dropped off in the client’s >>>> memory, then the write list is preferred. A write list is basically a >>>> zero-copy I/O. >>> >>> The term "reply list" doesn't appear in either RFC. I believe you mean >>> "client-posted write list" in most of the above, except this last >>> paragraph, which should have started with "Inline and server-posted read list...” ? >> >> No, I meant “reply list.” Definitely not read list. >> >> The terms used in the RFCs and the implementations vary, > > OK. Would you mind defining the term "reply list" for me? Google's not helping. Let’s look at section 4.3 of RFC 5666. Each RPC/RDMA header begins with this: struct rdma_msg { uint32 rdma_xid; /* Mirrors the RPC header xid */ uint32 rdma_vers; /* Version of this protocol */ uint32 rdma_credit; /* Buffers requested/granted */ rdma_body rdma_body; }; rdma_body starts with a uint32 which discriminates a union: union rdma_body switch (rdma_proc proc) { . . . case RDMA_NOMSG: rpc_rdma_header_nomsg rdma_nomsg; . . . }; When “proc” == RDMA_NOMSG, rdma_body is made up of three lists: struct rpc_rdma_header_nomsg { struct xdr_read_list *rdma_reads; struct xdr_write_list *rdma_writes; struct xdr_write_chunk *rdma_reply; }; The “reply list” is that last part: rdma_reply, which is a counted array of xdr_rdma_segment’s. Large replies for non-NFS READ operations are sent using RDMA_NOMSG. The RPC/RDMA header is sent as the inline portion of the message. The RPC reply message (the part we are all familiar with) is planted in the memory region described by rdma_reply, it’s not inline. rdma_reply is a write chunk. The server WRITEs its RPC reply into the memory region described by rdma_reply. That description was provided by the client in the matching RPC call message. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/05/2015 11:48 AM, J. Bruce Fields wrote: > On Thu, Feb 05, 2015 at 11:43:46AM -0500, Anna Schumaker wrote: >> On 02/05/2015 11:23 AM, Christoph Hellwig wrote: >>> On Thu, Feb 05, 2015 at 11:06:04AM -0500, Anna Schumaker wrote: >>>>> If the READ_PLUS implementation can't use the splice read path it's >>>>> probably a loss for most workloads. >>>>> >>>> >>>> I'll play around with the splice path, but I don't think it was designed for encoding multiple read segments. >>> >>> The problem is that the typical case of all data won't use splice >>> every with your patches as the 4.2 client will always send a READ_PLUS. >>> >>> So we'll have to find a way to use it where it helps. While we might be >>> able to add some hacks to only use splice for the first segment I guess >>> we just need to make the splice support generic enough in the long run. >>> >> >> I should be able to use splice if I detect that we're only returning a single DATA segment easily enough. > > Oh, good thought, yes maybe that would be enough. > > But I still wish he had more evidence here. I switched to using splice and fixed a few other bugs that happened to pop up. xfstests definitely runs faster compared to the old code, but I haven't had a chance to compare against other NFS versions yet. Anna > > --b. > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 06, 2015 at 04:12:51PM -0500, Chuck Lever wrote: > > On Feb 6, 2015, at 3:28 PM, J. Bruce Fields <bfields@fieldses.org> wrote: > > > On Fri, Feb 06, 2015 at 03:07:08PM -0500, Chuck Lever wrote: > >> > >> On Feb 6, 2015, at 2:35 PM, J. Bruce Fields <bfields@fieldses.org> wrote: > >>>> > >>>> Small replies are sent inline. There is a size maximum for inline > >>>> messages, however. I guess 5667 section 5 assumes this context, which > >>>> appears throughout RFC 5666. > >>>> > >>>> If an expected reply exceeds the inline size, then a client will > >>>> set up a reply list for the server. A memory region on the client is > >>>> registered as a target for RDMA WRITE operations, and the co-ordinates > >>>> of that region are sent to the server in the RPC call. > >>>> > >>>> If the server finds the reply will indeed be larger than the inline > >>>> maximum, it plants the reply in the client memory region described by > >>>> the request’s reply list, and repeats the co-ordinates of that region > >>>> back to the client in the RPC reply. > >>>> > >>>> A server may also choose to send a small reply inline, even if the > >>>> client provided a reply list. In that case, the server does not > >>>> repeat the reply list in the reply, and the full reply appears > >>>> inline. > >>>> > >>>> Linux registers part of the RPC reply buffer for the reply list. After > >>>> it is received on the client, the reply payload is copied by the client > >>>> CPU to its final destination. > >>>> > >>>> Inline and reply list are the mechanisms used when the upper layer > >>>> has some processing to do to the incoming data (eg READDIR). When > >>>> a request just needs raw data to be simply dropped off in the client’s > >>>> memory, then the write list is preferred. A write list is basically a > >>>> zero-copy I/O. > >>> > >>> The term "reply list" doesn't appear in either RFC. I believe you mean > >>> "client-posted write list" in most of the above, except this last > >>> paragraph, which should have started with "Inline and server-posted read list...” ? > >> > >> No, I meant “reply list.” Definitely not read list. > >> > >> The terms used in the RFCs and the implementations vary, > > > > OK. Would you mind defining the term "reply list" for me? Google's not helping. > > Let’s look at section 4.3 of RFC 5666. Each RPC/RDMA header begins > with this: > > struct rdma_msg { > uint32 rdma_xid; /* Mirrors the RPC header xid */ > uint32 rdma_vers; /* Version of this protocol */ > uint32 rdma_credit; /* Buffers requested/granted */ > rdma_body rdma_body; > }; > > rdma_body starts with a uint32 which discriminates a union: > > union rdma_body switch (rdma_proc proc) { > . . . > case RDMA_NOMSG: > rpc_rdma_header_nomsg rdma_nomsg; > . . . > }; > > When “proc” == RDMA_NOMSG, rdma_body is made up of three lists: > > struct rpc_rdma_header_nomsg { > struct xdr_read_list *rdma_reads; > struct xdr_write_list *rdma_writes; > struct xdr_write_chunk *rdma_reply; > }; > > The “reply list” is that last part: rdma_reply, which is a counted > array of xdr_rdma_segment’s. > > Large replies for non-NFS READ operations are sent using RDMA_NOMSG. > The RPC/RDMA header is sent as the inline portion of the message. > The RPC reply message (the part we are all familiar with) is planted > in the memory region described by rdma_reply, it’s not inline. > > rdma_reply is a write chunk. The server WRITEs its RPC reply into the > memory region described by rdma_reply. That description was provided > by the client in the matching RPC call message. Thanks! Gah, my apologies, obviously I didn't understand the reference to section 5.2 before. I think I understand now.... And I'll be interested to see what we come up with for READ_PLUS case. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/05/2015 11:48 AM, J. Bruce Fields wrote: > On Thu, Feb 05, 2015 at 11:43:46AM -0500, Anna Schumaker wrote: >> On 02/05/2015 11:23 AM, Christoph Hellwig wrote: >>> On Thu, Feb 05, 2015 at 11:06:04AM -0500, Anna Schumaker wrote: >>>>> If the READ_PLUS implementation can't use the splice read path it's >>>>> probably a loss for most workloads. >>>>> >>>> >>>> I'll play around with the splice path, but I don't think it was designed for encoding multiple read segments. >>> >>> The problem is that the typical case of all data won't use splice >>> every with your patches as the 4.2 client will always send a READ_PLUS. >>> >>> So we'll have to find a way to use it where it helps. While we might be >>> able to add some hacks to only use splice for the first segment I guess >>> we just need to make the splice support generic enough in the long run. >>> >> >> I should be able to use splice if I detect that we're only returning a single DATA segment easily enough. > > Oh, good thought, yes maybe that would be enough. > > But I still wish he had more evidence here. I'm not seeing a huge performance increase with READ_PLUS compared to READ (in fact, it's often a bit slower compared to READ, even when using splice). My guess is that the problem is mostly on the client's end since we have to do a memory shift on each segment to get everything lined up properly. I'm playing around with code that cuts down the number of memory shifts, but I still have a few bugs to work out before I'll know if it actually helps. Anna > > --b. > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 11, 2015 at 11:04 AM, Anna Schumaker <Anna.Schumaker@netapp.com> wrote: > On 02/05/2015 11:48 AM, J. Bruce Fields wrote: >> On Thu, Feb 05, 2015 at 11:43:46AM -0500, Anna Schumaker wrote: >>> On 02/05/2015 11:23 AM, Christoph Hellwig wrote: >>>> On Thu, Feb 05, 2015 at 11:06:04AM -0500, Anna Schumaker wrote: >>>>>> If the READ_PLUS implementation can't use the splice read path it's >>>>>> probably a loss for most workloads. >>>>>> >>>>> >>>>> I'll play around with the splice path, but I don't think it was designed for encoding multiple read segments. >>>> >>>> The problem is that the typical case of all data won't use splice >>>> every with your patches as the 4.2 client will always send a READ_PLUS. >>>> >>>> So we'll have to find a way to use it where it helps. While we might be >>>> able to add some hacks to only use splice for the first segment I guess >>>> we just need to make the splice support generic enough in the long run. >>>> >>> >>> I should be able to use splice if I detect that we're only returning a single DATA segment easily enough. >> >> Oh, good thought, yes maybe that would be enough. >> >> But I still wish he had more evidence here. > > I'm not seeing a huge performance increase with READ_PLUS compared to READ (in fact, it's often a bit slower compared to READ, even when using splice). My guess is that the problem is mostly on the client's end since we have to do a memory shift on each segment to get everything lined up properly. I'm playing around with code that cuts down the number of memory shifts, but I still have a few bugs to work out before I'll know if it actually helps. > I'm wondering if the right way to do READ_PLUS would have been to instead have a separate function READ_SPARSE, that will return a list of all sparse areas in the supplied range. We could even make that a READ_SAME, that can do the same for patterned data. The thing is that READ works just fine for what we want it to do. The real win here would be if given a very large file, we could request a list of all the sparse areas in, say, a 100GB range, and then use that data to build up a bitmap of unallocated blocks for which we can skip the READ requests.
On Wed, Feb 11, 2015 at 11:13:38AM -0500, Trond Myklebust wrote: > On Wed, Feb 11, 2015 at 11:04 AM, Anna Schumaker > <Anna.Schumaker@netapp.com> wrote: > > I'm not seeing a huge performance increase with READ_PLUS compared to READ (in fact, it's often a bit slower compared to READ, even when using splice). My guess is that the problem is mostly on the client's end since we have to do a memory shift on each segment to get everything lined up properly. I'm playing around with code that cuts down the number of memory shifts, but I still have a few bugs to work out before I'll know if it actually helps. > > > > I'm wondering if the right way to do READ_PLUS would have been to > instead have a separate function READ_SPARSE, that will return a list > of all sparse areas in the supplied range. We could even make that a > READ_SAME, that can do the same for patterned data. I worry about ending up with incoherent results, but perhaps it's no different from the current behavior since we're already piecing together our idea of the file content from multiple reads sent in parallel. > The thing is that READ works just fine for what we want it to do. The > real win here would be if given a very large file, we could request a > list of all the sparse areas in, say, a 100GB range, and then use that > data to build up a bitmap of unallocated blocks for which we can skip > the READ requests. Can we start by having the server return a single data extent covering the whole read request, with the single exception of the case where the read falls entirely within a hole? I think that should help in the case of large holes without interfering with the client's zero-copy logic in the case there are no large holes. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 11, 2015 at 11:22 AM, J. Bruce Fields <bfields@fieldses.org> wrote: > On Wed, Feb 11, 2015 at 11:13:38AM -0500, Trond Myklebust wrote: >> On Wed, Feb 11, 2015 at 11:04 AM, Anna Schumaker >> <Anna.Schumaker@netapp.com> wrote: >> > I'm not seeing a huge performance increase with READ_PLUS compared to READ (in fact, it's often a bit slower compared to READ, even when using splice). My guess is that the problem is mostly on the client's end since we have to do a memory shift on each segment to get everything lined up properly. I'm playing around with code that cuts down the number of memory shifts, but I still have a few bugs to work out before I'll know if it actually helps. >> > >> >> I'm wondering if the right way to do READ_PLUS would have been to >> instead have a separate function READ_SPARSE, that will return a list >> of all sparse areas in the supplied range. We could even make that a >> READ_SAME, that can do the same for patterned data. > > I worry about ending up with incoherent results, but perhaps it's no > different from the current behavior since we're already piecing together > our idea of the file content from multiple reads sent in parallel. I don't see what the problem is. The client sends a READ_SPARSE, and caches the existence or not of a hole. How is that in any way different from caching the results of a read that returns no data? >> The thing is that READ works just fine for what we want it to do. The >> real win here would be if given a very large file, we could request a >> list of all the sparse areas in, say, a 100GB range, and then use that >> data to build up a bitmap of unallocated blocks for which we can skip >> the READ requests. > > Can we start by having the server return a single data extent covering > the whole read request, with the single exception of the case where the > read falls entirely within a hole? > > I think that should help in the case of large holes without interfering > with the client's zero-copy logic in the case there are no large holes. > That still forces the server to do extra work on each read: it has to check for the presence of a hole or not instead of just filling the buffer with data.
On Wed, Feb 11, 2015 at 11:31:43AM -0500, Trond Myklebust wrote: > On Wed, Feb 11, 2015 at 11:22 AM, J. Bruce Fields <bfields@fieldses.org> wrote: > > On Wed, Feb 11, 2015 at 11:13:38AM -0500, Trond Myklebust wrote: > >> On Wed, Feb 11, 2015 at 11:04 AM, Anna Schumaker > >> <Anna.Schumaker@netapp.com> wrote: > >> > I'm not seeing a huge performance increase with READ_PLUS compared to READ (in fact, it's often a bit slower compared to READ, even when using splice). My guess is that the problem is mostly on the client's end since we have to do a memory shift on each segment to get everything lined up properly. I'm playing around with code that cuts down the number of memory shifts, but I still have a few bugs to work out before I'll know if it actually helps. > >> > > >> > >> I'm wondering if the right way to do READ_PLUS would have been to > >> instead have a separate function READ_SPARSE, that will return a list > >> of all sparse areas in the supplied range. We could even make that a > >> READ_SAME, that can do the same for patterned data. > > > > I worry about ending up with incoherent results, but perhaps it's no > > different from the current behavior since we're already piecing together > > our idea of the file content from multiple reads sent in parallel. > > I don't see what the problem is. The client sends a READ_SPARSE, and > caches the existence or not of a hole. How is that in any way > different from caching the results of a read that returns no data? > > >> The thing is that READ works just fine for what we want it to do. The > >> real win here would be if given a very large file, we could request a > >> list of all the sparse areas in, say, a 100GB range, and then use that > >> data to build up a bitmap of unallocated blocks for which we can skip > >> the READ requests. > > > > Can we start by having the server return a single data extent covering > > the whole read request, with the single exception of the case where the > > read falls entirely within a hole? > > > > I think that should help in the case of large holes without interfering > > with the client's zero-copy logic in the case there are no large holes. > > > > That still forces the server to do extra work on each read: it has to > check for the presence of a hole or not instead of just filling the > buffer with data. I guess. Presumably the filesystem already does something like that on read, so I find it hard to imagine that extra check is expensive, especially if it's amortized over large-ish reads. Seems worth checking at least. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 11, 2015 at 12:39 PM, Marc Eshel <eshel@us.ibm.com> wrote: > linux-nfs-owner@vger.kernel.org wrote on 02/11/2015 08:31:43 AM: > >> From: Trond Myklebust <trond.myklebust@primarydata.com> >> To: "J. Bruce Fields" <bfields@fieldses.org> >> Cc: Anna Schumaker <Anna.Schumaker@netapp.com>, Christoph Hellwig >> <hch@infradead.org>, Linux NFS Mailing List <linux- >> nfs@vger.kernel.org>, Thomas D Haynes <thomas.haynes@primarydata.com> >> Date: 02/11/2015 08:31 AM >> Subject: Re: [PATCH v2 2/4] NFSD: Add READ_PLUS support for data segments >> Sent by: linux-nfs-owner@vger.kernel.org >> >> On Wed, Feb 11, 2015 at 11:22 AM, J. Bruce Fields >> <bfields@fieldses.org> wrote: >> > On Wed, Feb 11, 2015 at 11:13:38AM -0500, Trond Myklebust wrote: >> >> On Wed, Feb 11, 2015 at 11:04 AM, Anna Schumaker >> >> <Anna.Schumaker@netapp.com> wrote: >> >> > I'm not seeing a huge performance increase with READ_PLUS >> compared to READ (in fact, it's often a bit slower compared to READ, >> even when using splice). My guess is that the problem is mostly on >> the client's end since we have to do a memory shift on each segment >> to get everything lined up properly. I'm playing around with code >> that cuts down the number of memory shifts, but I still have a few >> bugs to work out before I'll know if it actually helps. >> >> > >> >> >> >> I'm wondering if the right way to do READ_PLUS would have been to >> >> instead have a separate function READ_SPARSE, that will return a list >> >> of all sparse areas in the supplied range. We could even make that a >> >> READ_SAME, that can do the same for patterned data. >> > >> > I worry about ending up with incoherent results, but perhaps it's no >> > different from the current behavior since we're already piecing together >> > our idea of the file content from multiple reads sent in parallel. >> >> I don't see what the problem is. The client sends a READ_SPARSE, and >> caches the existence or not of a hole. How is that in any way >> different from caching the results of a read that returns no data? >> >> >> The thing is that READ works just fine for what we want it to do. The >> >> real win here would be if given a very large file, we could request a >> >> list of all the sparse areas in, say, a 100GB range, and then use that >> >> data to build up a bitmap of unallocated blocks for which we can skip >> >> the READ requests. >> > >> > Can we start by having the server return a single data extent covering >> > the whole read request, with the single exception of the case where the >> > read falls entirely within a hole? >> > >> > I think that should help in the case of large holes without interfering >> > with the client's zero-copy logic in the case there are no large holes. >> > >> >> That still forces the server to do extra work on each read: it has to >> check for the presence of a hole or not instead of just filling the >> buffer with data. > > A good hint that we are dealing with a sparse file is the the number of > blocks don't add up to the reported filesize > Marc. Sure, but that still adds up to an unnecessary inefficiency on each READ_PLUS call to that file. My point is that the best way for the client to process this information (and for the server too) is to read the sparseness map in once as a bulk operation on a very large chunk of the file, and then to use that map as a guide for when it needs to call READ.
On Wed, Feb 11, 2015 at 12:47:26PM -0500, Trond Myklebust wrote: > On Wed, Feb 11, 2015 at 12:39 PM, Marc Eshel <eshel@us.ibm.com> wrote: > > > > A good hint that we are dealing with a sparse file is the the number of > > blocks don't add up to the reported filesize > > Sure, but that still adds up to an unnecessary inefficiency on each > READ_PLUS call to that file. > > My point is that the best way for the client to process this > information (and for the server too) is to read the sparseness map in > once as a bulk operation on a very large chunk of the file, and then > to use that map as a guide for when it needs to call READ. I thought we agreed that the sparseness map made no sense because the information was immeadiately stale? Anyway, the client could build that map if it wanted to using SEEK. I'm not arguing that it would be efficient, but that it could be done with a cycle of looking for holes. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 11, 2015 at 1:17 PM, Tom Haynes <thomas.haynes@primarydata.com> wrote: > On Wed, Feb 11, 2015 at 12:47:26PM -0500, Trond Myklebust wrote: >> On Wed, Feb 11, 2015 at 12:39 PM, Marc Eshel <eshel@us.ibm.com> wrote: >> > >> > A good hint that we are dealing with a sparse file is the the number of >> > blocks don't add up to the reported filesize >> >> Sure, but that still adds up to an unnecessary inefficiency on each >> READ_PLUS call to that file. >> >> My point is that the best way for the client to process this >> information (and for the server too) is to read the sparseness map in >> once as a bulk operation on a very large chunk of the file, and then >> to use that map as a guide for when it needs to call READ. > > I thought we agreed that the sparseness map made no sense because > the information was immeadiately stale? Right now, we're caching the zero reads, aren't we? What makes caching hole information so special? > Anyway, the client could build that map if it wanted to using SEEK. I'm > not arguing that it would be efficient, but that it could be done > with a cycle of looking for holes. Sure, however that's not necessarily useful as an optimiser either. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/11/2015 01:49 PM, Trond Myklebust wrote: > On Wed, Feb 11, 2015 at 1:17 PM, Tom Haynes > <thomas.haynes@primarydata.com> wrote: >> On Wed, Feb 11, 2015 at 12:47:26PM -0500, Trond Myklebust wrote: >>> On Wed, Feb 11, 2015 at 12:39 PM, Marc Eshel <eshel@us.ibm.com> wrote: >>>> >>>> A good hint that we are dealing with a sparse file is the the number of >>>> blocks don't add up to the reported filesize >>> >>> Sure, but that still adds up to an unnecessary inefficiency on each >>> READ_PLUS call to that file. >>> >>> My point is that the best way for the client to process this >>> information (and for the server too) is to read the sparseness map in >>> once as a bulk operation on a very large chunk of the file, and then >>> to use that map as a guide for when it needs to call READ. >> >> I thought we agreed that the sparseness map made no sense because >> the information was immeadiately stale? > > Right now, we're caching the zero reads, aren't we? What makes caching > hole information so special? > >> Anyway, the client could build that map if it wanted to using SEEK. I'm >> not arguing that it would be efficient, but that it could be done >> with a cycle of looking for holes. > > Sure, however that's not necessarily useful as an optimiser either. The vfs on Linux doesn't keep a sparse map either, so the server would be stuck doing seeks to build this up anyway. I've already seen that this can be somewhat unreliable while working on READ_PLUS. > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/11/2015 11:22 AM, J. Bruce Fields wrote: > On Wed, Feb 11, 2015 at 11:13:38AM -0500, Trond Myklebust wrote: >> On Wed, Feb 11, 2015 at 11:04 AM, Anna Schumaker >> <Anna.Schumaker@netapp.com> wrote: >>> I'm not seeing a huge performance increase with READ_PLUS compared to READ (in fact, it's often a bit slower compared to READ, even when using splice). My guess is that the problem is mostly on the client's end since we have to do a memory shift on each segment to get everything lined up properly. I'm playing around with code that cuts down the number of memory shifts, but I still have a few bugs to work out before I'll know if it actually helps. >>> >> >> I'm wondering if the right way to do READ_PLUS would have been to >> instead have a separate function READ_SPARSE, that will return a list >> of all sparse areas in the supplied range. We could even make that a >> READ_SAME, that can do the same for patterned data. > > I worry about ending up with incoherent results, but perhaps it's no > different from the current behavior since we're already piecing together > our idea of the file content from multiple reads sent in parallel. > >> The thing is that READ works just fine for what we want it to do. The >> real win here would be if given a very large file, we could request a >> list of all the sparse areas in, say, a 100GB range, and then use that >> data to build up a bitmap of unallocated blocks for which we can skip >> the READ requests. > > Can we start by having the server return a single data extent covering > the whole read request, with the single exception of the case where the > read falls entirely within a hole? I'm trying this and it's still giving me pretty bad performance. I picked out 6 xfstests that read sparse files, and v4.2 takes almost a minute longer to run compared to v4.1. (1:30 vs 2:22). I'm going to look into how I zero pages on the client - maybe that can be combined with the right-shift function so pages only need to be mapped into memory once. Anna > > I think that should help in the case of large holes without interfering > with the client's zero-copy logic in the case there are no large holes. > > --b. > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 11, 2015 at 11:13:38AM -0500, Trond Myklebust wrote: > I'm wondering if the right way to do READ_PLUS would have been to > instead have a separate function READ_SPARSE, that will return a list > of all sparse areas in the supplied range. We could even make that a > READ_SAME, that can do the same for patterned data. That soudns like the protocol level equivalent of the fiemap ioctl. While fiemap is useful for filesystem debugging, using it for detection of sparse ranges in oreutils turned out to be a desaster, and we instead oved to support lseek SEEK_DATA/SEEK_HOLE, which would map to SEEK in NFSv4.2. The userspace tools use the different in file size vs allocate blocks in the stat data to decide if they use SEEK_DATA/SEEK_HOLE, which should work just fine over NFSv4.2 as well. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 11, 2015 at 11:42:19AM -0500, J. Bruce Fields wrote: > I guess. Presumably the filesystem already does something like that on > read, so I find it hard to imagine that extra check is expensive, > especially if it's amortized over large-ish reads. Seems worth checking > at least. The filesystem zero-fills blocks when reading them from disk, which is serveral layers away from where it interacts with nfsd. We've been looking at an extension to read to return AGAIN when data can't be read at the momemnt, it would be logical to base on that and just return a hole size if there isn't any data, but implementing that would be a fair amount of work. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/11/2015 02:22 PM, Anna Schumaker wrote: > On 02/11/2015 11:22 AM, J. Bruce Fields wrote: >> On Wed, Feb 11, 2015 at 11:13:38AM -0500, Trond Myklebust wrote: >>> On Wed, Feb 11, 2015 at 11:04 AM, Anna Schumaker >>> <Anna.Schumaker@netapp.com> wrote: >>>> I'm not seeing a huge performance increase with READ_PLUS compared to READ (in fact, it's often a bit slower compared to READ, even when using splice). My guess is that the problem is mostly on the client's end since we have to do a memory shift on each segment to get everything lined up properly. I'm playing around with code that cuts down the number of memory shifts, but I still have a few bugs to work out before I'll know if it actually helps. >>>> >>> >>> I'm wondering if the right way to do READ_PLUS would have been to >>> instead have a separate function READ_SPARSE, that will return a list >>> of all sparse areas in the supplied range. We could even make that a >>> READ_SAME, that can do the same for patterned data. >> >> I worry about ending up with incoherent results, but perhaps it's no >> different from the current behavior since we're already piecing together >> our idea of the file content from multiple reads sent in parallel. >> >>> The thing is that READ works just fine for what we want it to do. The >>> real win here would be if given a very large file, we could request a >>> list of all the sparse areas in, say, a 100GB range, and then use that >>> data to build up a bitmap of unallocated blocks for which we can skip >>> the READ requests. >> >> Can we start by having the server return a single data extent covering >> the whole read request, with the single exception of the case where the >> read falls entirely within a hole? > > I'm trying this and it's still giving me pretty bad performance. I picked out 6 xfstests that read sparse files, and v4.2 takes almost a minute longer to run compared to v4.1. (1:30 vs 2:22). > > I'm going to look into how I zero pages on the client - maybe that can be combined with the right-shift function so pages only need to be mapped into memory once. Today I learned all about how to use operf to identify where the bottleneck is :). It looks like the problem is in the hole zeroing code on the client side. Is there a better way than memset() to mark a page as all zeros? Anna > > Anna > >> >> I think that should help in the case of large holes without interfering >> with the client's zero-copy logic in the case there are no large holes. >> >> --b. >> > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 12, 2015 at 02:59:45PM -0500, Anna Schumaker wrote:
> Today I learned all about how to use operf to identify where the bottleneck is :). It looks like the problem is in the hole zeroing code on the client side. Is there a better way than memset() to mark a page as all zeros?
clear_highpage(), but I don't expect it to make a huge different.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 12, 2015 at 02:59:45PM -0500, Anna Schumaker wrote: > On 02/11/2015 02:22 PM, Anna Schumaker wrote: > > On 02/11/2015 11:22 AM, J. Bruce Fields wrote: > >> On Wed, Feb 11, 2015 at 11:13:38AM -0500, Trond Myklebust wrote: > >>> On Wed, Feb 11, 2015 at 11:04 AM, Anna Schumaker > >>> <Anna.Schumaker@netapp.com> wrote: > >>>> I'm not seeing a huge performance increase with READ_PLUS compared to READ (in fact, it's often a bit slower compared to READ, even when using splice). My guess is that the problem is mostly on the client's end since we have to do a memory shift on each segment to get everything lined up properly. I'm playing around with code that cuts down the number of memory shifts, but I still have a few bugs to work out before I'll know if it actually helps. > >>>> > >>> > >>> I'm wondering if the right way to do READ_PLUS would have been to > >>> instead have a separate function READ_SPARSE, that will return a list > >>> of all sparse areas in the supplied range. We could even make that a > >>> READ_SAME, that can do the same for patterned data. > >> > >> I worry about ending up with incoherent results, but perhaps it's no > >> different from the current behavior since we're already piecing together > >> our idea of the file content from multiple reads sent in parallel. > >> > >>> The thing is that READ works just fine for what we want it to do. The > >>> real win here would be if given a very large file, we could request a > >>> list of all the sparse areas in, say, a 100GB range, and then use that > >>> data to build up a bitmap of unallocated blocks for which we can skip > >>> the READ requests. > >> > >> Can we start by having the server return a single data extent covering > >> the whole read request, with the single exception of the case where the > >> read falls entirely within a hole? > > > > I'm trying this and it's still giving me pretty bad performance. I picked out 6 xfstests that read sparse files, and v4.2 takes almost a minute longer to run compared to v4.1. (1:30 vs 2:22). > > > > I'm going to look into how I zero pages on the client - maybe that can be combined with the right-shift function so pages only need to be mapped into memory once. > > Today I learned all about how to use operf to identify where the bottleneck is :). It looks like the problem is in the hole zeroing code on the client side. Is there a better way than memset() to mark a page as all zeros? I'm a little surprised that replacing a READ with a READ_PLUS + memset is slower, and I'm really surprised that it's a minute slower. Are we sure there isn't something else going on? Does comparing the operation counts (from /proc/self/mountstats) show anything interesting? --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c index ac71d13..fdf9ce3 100644 --- a/fs/nfsd/nfs4proc.c +++ b/fs/nfsd/nfs4proc.c @@ -1603,6 +1603,16 @@ static inline u32 nfsd4_read_rsize(struct svc_rqst *rqstp, struct nfsd4_op *op) return (op_encode_hdr_size + 2 + XDR_QUADLEN(rlen)) * sizeof(__be32); } +static inline u32 nfsd4_read_plus_rsize(struct svc_rqst *rqstp, struct nfsd4_op *op) +{ + u32 maxcount = svc_max_payload(rqstp); + u32 rlen = min(op->u.read.rd_length, maxcount); + /* extra xdr space for encoding read plus segments. */ + u32 xdr = 4; + + return (op_encode_hdr_size + 2 + xdr + XDR_QUADLEN(rlen)) * sizeof(__be32); +} + static inline u32 nfsd4_readdir_rsize(struct svc_rqst *rqstp, struct nfsd4_op *op) { u32 maxcount = 0, rlen = 0; @@ -1980,6 +1990,12 @@ static struct nfsd4_operation nfsd4_ops[] = { .op_name = "OP_DEALLOCATE", .op_rsize_bop = (nfsd4op_rsize)nfsd4_write_rsize, }, + [OP_READ_PLUS] = { + .op_func = (nfsd4op_func)nfsd4_read, + .op_name = "OP_READ_PLUS", + .op_rsize_bop = (nfsd4op_rsize)nfsd4_read_plus_rsize, + .op_get_currentstateid = (stateid_getter)nfsd4_get_readstateid, + }, [OP_SEEK] = { .op_func = (nfsd4op_func)nfsd4_seek, .op_name = "OP_SEEK", diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c index 01248b2..480b07f 100644 --- a/fs/nfsd/nfs4xdr.c +++ b/fs/nfsd/nfs4xdr.c @@ -1630,7 +1630,7 @@ static nfsd4_dec nfsd4_dec_ops[] = { [OP_LAYOUTSTATS] = (nfsd4_dec)nfsd4_decode_notsupp, [OP_OFFLOAD_CANCEL] = (nfsd4_dec)nfsd4_decode_notsupp, [OP_OFFLOAD_STATUS] = (nfsd4_dec)nfsd4_decode_notsupp, - [OP_READ_PLUS] = (nfsd4_dec)nfsd4_decode_notsupp, + [OP_READ_PLUS] = (nfsd4_dec)nfsd4_decode_read, [OP_SEEK] = (nfsd4_dec)nfsd4_decode_seek, [OP_WRITE_SAME] = (nfsd4_dec)nfsd4_decode_notsupp, }; @@ -3802,6 +3802,81 @@ nfsd4_encode_test_stateid(struct nfsd4_compoundres *resp, __be32 nfserr, } static __be32 +nfsd4_encode_read_plus_data(struct nfsd4_compoundres *resp, struct nfsd4_read *read, + struct file *file) +{ + __be32 *p, err; + unsigned long maxcount; + struct xdr_stream *xdr = &resp->xdr; + + p = xdr_reserve_space(xdr, 4 + 8 + 4); /* content type, offset, maxcount */ + if (!p) + return nfserr_resource; + xdr_commit_encode(xdr); + + maxcount = svc_max_payload(resp->rqstp); + maxcount = min_t(unsigned long, maxcount, (xdr->buf->buflen - xdr->buf->len)); + maxcount = min_t(unsigned long, maxcount, read->rd_length); + + err = nfsd4_encode_readv(resp, read, file, &maxcount); + + *p++ = cpu_to_be32(NFS4_CONTENT_DATA); + p = xdr_encode_hyper(p, read->rd_offset); + *p++ = cpu_to_be32(maxcount); + + read->rd_offset += maxcount; + return err; +} + +static __be32 +nfsd4_encode_read_plus(struct nfsd4_compoundres *resp, __be32 nfserr, + struct nfsd4_read *read) +{ + struct xdr_stream *xdr = &resp->xdr; + struct file *file = read->rd_filp; + int starting_len = xdr->buf->len; + struct raparms *ra; + __be32 *p; + __be32 err; + u32 eof, segments = 0; + + if (nfserr) + return nfserr; + + /* eof flag, segment count */ + p = xdr_reserve_space(xdr, 4 + 4 ); + if (!p) + return nfserr_resource; + xdr_commit_encode(xdr); + + if (!read->rd_filp) { + err = nfsd_get_tmp_read_open(resp->rqstp, read->rd_fhp, + &file, &ra); + if (err) + goto err_truncate; + } + + if (read->rd_offset >= i_size_read(file_inode(file))) + goto out_encode; + + err = nfsd4_encode_read_plus_data(resp, read, file); + segments++; + +out_encode: + eof = (read->rd_offset >= i_size_read(file_inode(file))); + *p++ = cpu_to_be32(eof); + *p++ = cpu_to_be32(segments); + + if (!read->rd_filp) + nfsd_put_tmp_read_open(file, ra); + +err_truncate: + if (err) + xdr_truncate_encode(xdr, starting_len); + return err; +} + +static __be32 nfsd4_encode_seek(struct nfsd4_compoundres *resp, __be32 nfserr, struct nfsd4_seek *seek) { @@ -3900,7 +3975,7 @@ static nfsd4_enc nfsd4_enc_ops[] = { [OP_LAYOUTSTATS] = (nfsd4_enc)nfsd4_encode_noop, [OP_OFFLOAD_CANCEL] = (nfsd4_enc)nfsd4_encode_noop, [OP_OFFLOAD_STATUS] = (nfsd4_enc)nfsd4_encode_noop, - [OP_READ_PLUS] = (nfsd4_enc)nfsd4_encode_noop, + [OP_READ_PLUS] = (nfsd4_enc)nfsd4_encode_read_plus, [OP_SEEK] = (nfsd4_enc)nfsd4_encode_seek, [OP_WRITE_SAME] = (nfsd4_enc)nfsd4_encode_noop, };