Message ID | 4DE65202.2010502@panasas.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Jun 1, 2011, at 10:51 AM, Benny Halevy wrote: > On 2011-06-01 17:44, Weston Andros Adamson wrote: >> >> On Jun 1, 2011, at 1:47 AM, Boaz Harrosh wrote: >> >>> On 06/01/2011 06:18 AM, Weston Andros Adamson wrote: >>>> Use nfs_generic_pg_test instead of pnfs_generic_pg_test. >>>> >>>> This fixes the BUG at fs/nfs/write.c:941 introduced by >>>> 89a58e32d9105c01022a757fb32ddc3b51bf0025. >>>> >>>> I was able to trigger this BUG reliably using pynfs in pnfs mode, >>>> by using dd(1) to write many small blocks. >>>> >>>> Signed-off-by: Weston Andros Adamson <dros@netapp.com> >>>> --- >>>> Fix proposed by Trond. >>>> >>>> Benny- Does this make sense? >>>> >>>> fs/nfs/nfs4filelayout.c | 2 +- >>>> fs/nfs/pagelist.c | 5 ++++- >>>> include/linux/nfs_page.h | 3 ++- >>>> 3 files changed, 7 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c >>>> index 4269088..1c3bb72 100644 >>>> --- a/fs/nfs/nfs4filelayout.c >>>> +++ b/fs/nfs/nfs4filelayout.c >>>> @@ -661,7 +661,7 @@ filelayout_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev, >>>> u64 p_stripe, r_stripe; >>>> u32 stripe_unit; >>>> >>>> - if (!pnfs_generic_pg_test(pgio, prev, req)) >>>> + if (!nfs_generic_pg_test(pgio, prev, req)) >>>> return 0; >>>> >>> >>> pnfs_generic_pg_test is the one that gets the layout. >>> >>> What you've done is revert to MDS IO >>> >>> Boaz >> >> Ah, you're right - I didn't even notice that! I usually confirm client -> DS communication with tcpdump. I was working for too long yesterday :) >> >> Patch: recalled. Discussion about a real fix: started. >> >> -dros > > I think the following should work: > > Benny > This diff fixes the problem for me (but with s/desc/pgio). -dros > git diff --stat -p -M > fs/nfs/nfs4filelayout.c | 10 ++++++++++ > 1 files changed, 10 insertions(+), 0 deletions(-) > > diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c > index 4269088..9f1d445 100644 > --- a/fs/nfs/nfs4filelayout.c > +++ b/fs/nfs/nfs4filelayout.c > @@ -661,6 +661,16 @@ filelayout_pg_test(struct nfs_pageio_descriptor > *pgio, struct nfs_page *prev, > u64 p_stripe, r_stripe; > u32 stripe_unit; > > + /* > + * FIXME: ideally we should be able to coalesce all requests > + * that are not block boundary aligned, but currently this > + * is problematic for the case of bsize < PAGE_CACHE_SIZE, > + * since nfs_flush_multi and nfs_pagein_multi assume you > + * can have only one struct nfs_page. > + */ > + if (desc->pg_bsize < PAGE_SIZE) > + return 0; > + > if (!pnfs_generic_pg_test(pgio, prev, req)) > return 0; > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 1, 2011 at 10:51 AM, Benny Halevy <bhalevy@panasas.com> wrote: > On 2011-06-01 17:44, Weston Andros Adamson wrote: >> >> On Jun 1, 2011, at 1:47 AM, Boaz Harrosh wrote: >> >>> On 06/01/2011 06:18 AM, Weston Andros Adamson wrote: >>>> Use nfs_generic_pg_test instead of pnfs_generic_pg_test. >>>> >>>> This fixes the BUG at fs/nfs/write.c:941 introduced by >>>> 89a58e32d9105c01022a757fb32ddc3b51bf0025. >>>> >>>> I was able to trigger this BUG reliably using pynfs in pnfs mode, >>>> by using dd(1) to write many small blocks. >>>> >>>> Signed-off-by: Weston Andros Adamson <dros@netapp.com> >>>> --- >>>> Fix proposed by Trond. >>>> >>>> Benny- Does this make sense? >>>> >>>> fs/nfs/nfs4filelayout.c | 2 +- >>>> fs/nfs/pagelist.c | 5 ++++- >>>> include/linux/nfs_page.h | 3 ++- >>>> 3 files changed, 7 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c >>>> index 4269088..1c3bb72 100644 >>>> --- a/fs/nfs/nfs4filelayout.c >>>> +++ b/fs/nfs/nfs4filelayout.c >>>> @@ -661,7 +661,7 @@ filelayout_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev, >>>> u64 p_stripe, r_stripe; >>>> u32 stripe_unit; >>>> >>>> - if (!pnfs_generic_pg_test(pgio, prev, req)) >>>> + if (!nfs_generic_pg_test(pgio, prev, req)) >>>> return 0; >>>> >>> >>> pnfs_generic_pg_test is the one that gets the layout. >>> >>> What you've done is revert to MDS IO >>> >>> Boaz >> >> Ah, you're right - I didn't even notice that! I usually confirm client -> DS communication with tcpdump. I was working for too long yesterday :) >> >> Patch: recalled. Discussion about a real fix: started. >> >> -dros > > I think the following should work: > > Benny > > git diff --stat -p -M > fs/nfs/nfs4filelayout.c | 10 ++++++++++ > 1 files changed, 10 insertions(+), 0 deletions(-) > > diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c > index 4269088..9f1d445 100644 > --- a/fs/nfs/nfs4filelayout.c > +++ b/fs/nfs/nfs4filelayout.c > @@ -661,6 +661,16 @@ filelayout_pg_test(struct nfs_pageio_descriptor > *pgio, struct nfs_page *prev, > u64 p_stripe, r_stripe; > u32 stripe_unit; > > + /* > + * FIXME: ideally we should be able to coalesce all requests > + * that are not block boundary aligned, but currently this > + * is problematic for the case of bsize < PAGE_CACHE_SIZE, > + * since nfs_flush_multi and nfs_pagein_multi assume you > + * can have only one struct nfs_page. > + */ > + if (desc->pg_bsize < PAGE_SIZE) > + return 0; > + > if (!pnfs_generic_pg_test(pgio, prev, req)) > return 0; > Note this moves a test that was once part of the plain nfs code into the file layout driver. Why don't other drivers need this test? Fred -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2011-06-01 at 17:51 +0300, Benny Halevy wrote: > I think the following should work: > > Benny > > git diff --stat -p -M > fs/nfs/nfs4filelayout.c | 10 ++++++++++ > 1 files changed, 10 insertions(+), 0 deletions(-) > > diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c > index 4269088..9f1d445 100644 > --- a/fs/nfs/nfs4filelayout.c > +++ b/fs/nfs/nfs4filelayout.c > @@ -661,6 +661,16 @@ filelayout_pg_test(struct nfs_pageio_descriptor > *pgio, struct nfs_page *prev, > u64 p_stripe, r_stripe; > u32 stripe_unit; > > + /* > + * FIXME: ideally we should be able to coalesce all requests > + * that are not block boundary aligned, but currently this > + * is problematic for the case of bsize < PAGE_CACHE_SIZE, > + * since nfs_flush_multi and nfs_pagein_multi assume you > + * can have only one struct nfs_page. > + */ > + if (desc->pg_bsize < PAGE_SIZE) > + return 0; > + > if (!pnfs_generic_pg_test(pgio, prev, req)) > return 0; So, there are several things that bother me about pnfs_generic_pg_test() too now that I'm looking more closely at it: 1. If the intention is to coalesce 'prev' and 'req', shouldn't we be asking for a layout with req_offset(prev) instead of req_offset(req)? 2. If we're only requesting a layout of length pg_count, don't we still need to test the layout length that the server actually returned before we can allow the coalescing? 3. if (!pgio->lseg), shouldn't we be returning an error of some sort? Right now we're returning 'true', and allowing the coalesce to occur. 4. Furthermore, shouldn't that test guarding the pnfs_update_layout() call rather be an 'if (pgio->pg_lseg == NULL)' instead of looking at the values of pg_count and prev->wb_bytes?
On 2011-06-01 19:01, Fred Isaman wrote: > On Wed, Jun 1, 2011 at 10:51 AM, Benny Halevy <bhalevy@panasas.com> wrote: >> On 2011-06-01 17:44, Weston Andros Adamson wrote: >>> >>> On Jun 1, 2011, at 1:47 AM, Boaz Harrosh wrote: >>> >>>> On 06/01/2011 06:18 AM, Weston Andros Adamson wrote: >>>>> Use nfs_generic_pg_test instead of pnfs_generic_pg_test. >>>>> >>>>> This fixes the BUG at fs/nfs/write.c:941 introduced by >>>>> 89a58e32d9105c01022a757fb32ddc3b51bf0025. >>>>> >>>>> I was able to trigger this BUG reliably using pynfs in pnfs mode, >>>>> by using dd(1) to write many small blocks. >>>>> >>>>> Signed-off-by: Weston Andros Adamson <dros@netapp.com> >>>>> --- >>>>> Fix proposed by Trond. >>>>> >>>>> Benny- Does this make sense? >>>>> >>>>> fs/nfs/nfs4filelayout.c | 2 +- >>>>> fs/nfs/pagelist.c | 5 ++++- >>>>> include/linux/nfs_page.h | 3 ++- >>>>> 3 files changed, 7 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c >>>>> index 4269088..1c3bb72 100644 >>>>> --- a/fs/nfs/nfs4filelayout.c >>>>> +++ b/fs/nfs/nfs4filelayout.c >>>>> @@ -661,7 +661,7 @@ filelayout_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev, >>>>> u64 p_stripe, r_stripe; >>>>> u32 stripe_unit; >>>>> >>>>> - if (!pnfs_generic_pg_test(pgio, prev, req)) >>>>> + if (!nfs_generic_pg_test(pgio, prev, req)) >>>>> return 0; >>>>> >>>> >>>> pnfs_generic_pg_test is the one that gets the layout. >>>> >>>> What you've done is revert to MDS IO >>>> >>>> Boaz >>> >>> Ah, you're right - I didn't even notice that! I usually confirm client -> DS communication with tcpdump. I was working for too long yesterday :) >>> >>> Patch: recalled. Discussion about a real fix: started. >>> >>> -dros >> >> I think the following should work: >> >> Benny >> >> git diff --stat -p -M >> fs/nfs/nfs4filelayout.c | 10 ++++++++++ >> 1 files changed, 10 insertions(+), 0 deletions(-) >> >> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c >> index 4269088..9f1d445 100644 >> --- a/fs/nfs/nfs4filelayout.c >> +++ b/fs/nfs/nfs4filelayout.c >> @@ -661,6 +661,16 @@ filelayout_pg_test(struct nfs_pageio_descriptor >> *pgio, struct nfs_page *prev, >> u64 p_stripe, r_stripe; >> u32 stripe_unit; >> >> + /* >> + * FIXME: ideally we should be able to coalesce all requests >> + * that are not block boundary aligned, but currently this >> + * is problematic for the case of bsize < PAGE_CACHE_SIZE, >> + * since nfs_flush_multi and nfs_pagein_multi assume you >> + * can have only one struct nfs_page. >> + */ >> + if (desc->pg_bsize < PAGE_SIZE) >> + return 0; >> + >> if (!pnfs_generic_pg_test(pgio, prev, req)) >> return 0; >> > > Note this moves a test that was once part of the plain nfs code into > the file layout driver. Why don't other drivers need this test? True. Note I said it would work, not that it's the right fix? :-/ This just tells us what change exposed this issue... Boaz moved this check to the nfs only path assuming that pg_bsize, which holds the MDS's wsize/rsize is irrelevant for coalescing requests for striping over pnfs. I'm still convinced why nfs_flush_multi cannot use desc->pg_lseg if it exists, but at the same time it seems like not doing the right thing for pnfs coalescing in nfs_pageio_init_write and nfs_pageio_do_add_request. For pnfs, we need to ignore wsize, meaning we first need to try to coalesce the pages and then decide if we're going the nfs_flush_multi or the nfs_flush_one way, based on the coalesced length. Benny > > Fred > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2011-06-01 21:07, Trond Myklebust wrote: > On Wed, 2011-06-01 at 17:51 +0300, Benny Halevy wrote: >> I think the following should work: >> >> Benny >> >> git diff --stat -p -M >> fs/nfs/nfs4filelayout.c | 10 ++++++++++ >> 1 files changed, 10 insertions(+), 0 deletions(-) >> >> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c >> index 4269088..9f1d445 100644 >> --- a/fs/nfs/nfs4filelayout.c >> +++ b/fs/nfs/nfs4filelayout.c >> @@ -661,6 +661,16 @@ filelayout_pg_test(struct nfs_pageio_descriptor >> *pgio, struct nfs_page *prev, >> u64 p_stripe, r_stripe; >> u32 stripe_unit; >> >> + /* >> + * FIXME: ideally we should be able to coalesce all requests >> + * that are not block boundary aligned, but currently this >> + * is problematic for the case of bsize < PAGE_CACHE_SIZE, >> + * since nfs_flush_multi and nfs_pagein_multi assume you >> + * can have only one struct nfs_page. >> + */ >> + if (desc->pg_bsize < PAGE_SIZE) >> + return 0; >> + >> if (!pnfs_generic_pg_test(pgio, prev, req)) >> return 0; > > So, there are several things that bother me about pnfs_generic_pg_test() > too now that I'm looking more closely at it: > > 1. If the intention is to coalesce 'prev' and 'req', shouldn't we > be asking for a layout with req_offset(prev) instead of > req_offset(req)? > 2. If we're only requesting a layout of length pg_count, don't we > still need to test the layout length that the server actually > returned before we can allow the coalescing? > 3. if (!pgio->lseg), shouldn't we be returning an error of some > sort? Right now we're returning 'true', and allowing the > coalesce to occur. > 4. Furthermore, shouldn't that test guarding the > pnfs_update_layout() call rather be an 'if (pgio->pg_lseg == > NULL)' instead of looking at the values of pg_count and > prev->wb_bytes? > or rather we get the layout for the first page in nfs_pageio_do_add_request when desc->pg_count == 0? Then, this lseg would be useful for nfs_flush_multi if we failed to coalesce, or we failed to get a layout altogether we go the nfs path and can reset pg_test to nfs_generic_pg_test. Otherwise I agree with your assertions above. Benny -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2011-06-01 at 21:56 +0300, Benny Halevy wrote: > For pnfs, we need to ignore wsize, meaning we first need to try > to coalesce the pages and then decide if we're going the nfs_flush_multi > or the nfs_flush_one way, based on the coalesced length. No! Ignoring the wsize is definitely wrong... If the stripe size is larger than the 'maxwrite' recommended attribute, then the DS is allowed to do a short write, in which case we have to resend. In any case, nfs_flush_multi and nfs_flush_one need a rewrite in order to deal properly with O_DIRECT writes, and so I'm expecting to get rid of the single nfs_page limit for the r/wsize<PAGE_SIZE case. Please don't make any large changes to this code at this time.
On Wed, 2011-06-01 at 22:13 +0300, Benny Halevy wrote: > On 2011-06-01 21:07, Trond Myklebust wrote: > > On Wed, 2011-06-01 at 17:51 +0300, Benny Halevy wrote: > >> I think the following should work: > >> > >> Benny > >> > >> git diff --stat -p -M > >> fs/nfs/nfs4filelayout.c | 10 ++++++++++ > >> 1 files changed, 10 insertions(+), 0 deletions(-) > >> > >> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c > >> index 4269088..9f1d445 100644 > >> --- a/fs/nfs/nfs4filelayout.c > >> +++ b/fs/nfs/nfs4filelayout.c > >> @@ -661,6 +661,16 @@ filelayout_pg_test(struct nfs_pageio_descriptor > >> *pgio, struct nfs_page *prev, > >> u64 p_stripe, r_stripe; > >> u32 stripe_unit; > >> > >> + /* > >> + * FIXME: ideally we should be able to coalesce all requests > >> + * that are not block boundary aligned, but currently this > >> + * is problematic for the case of bsize < PAGE_CACHE_SIZE, > >> + * since nfs_flush_multi and nfs_pagein_multi assume you > >> + * can have only one struct nfs_page. > >> + */ > >> + if (desc->pg_bsize < PAGE_SIZE) > >> + return 0; > >> + > >> if (!pnfs_generic_pg_test(pgio, prev, req)) > >> return 0; > > > > So, there are several things that bother me about pnfs_generic_pg_test() > > too now that I'm looking more closely at it: > > > > 1. If the intention is to coalesce 'prev' and 'req', shouldn't we > > be asking for a layout with req_offset(prev) instead of > > req_offset(req)? > > 2. If we're only requesting a layout of length pg_count, don't we > > still need to test the layout length that the server actually > > returned before we can allow the coalescing? > > 3. if (!pgio->lseg), shouldn't we be returning an error of some > > sort? Right now we're returning 'true', and allowing the > > coalesce to occur. > > 4. Furthermore, shouldn't that test guarding the > > pnfs_update_layout() call rather be an 'if (pgio->pg_lseg == > > NULL)' instead of looking at the values of pg_count and > > prev->wb_bytes? > > > > or rather we get the layout for the first page in > nfs_pageio_do_add_request when desc->pg_count == 0? I can live with a desc->pg_init() callback or rather, converting pg_test() and pg_doio() into a struct nfs_pageio_ops { int (*pg_init)(struct nfs_pageio_descriptor *desc, struct nfs_page *req); bool (*pg_test)(struct nfs_pageio_descriptor *desc, struct nfs_page *prev, struct nfs_page *req); int (*pg_doio)(struct nfs_pageio_descriptor *desc); }; and then replacing the two callback functions in the existing struct nfs_pageio_descriptor with a single pointer to a 'const struct nfs_pageio_ops'... > Then, this lseg would be useful for nfs_flush_multi > if we failed to coalesce, or we failed to get a layout > altogether we go the nfs path and can reset pg_test to > nfs_generic_pg_test. It would presumably also get rid of all those pnfs_update_layout() calls in read.c and write.c.
On 06/01/2011 10:17 PM, Trond Myklebust wrote: > On Wed, 2011-06-01 at 21:56 +0300, Benny Halevy wrote: > >> For pnfs, we need to ignore wsize, meaning we first need to try >> to coalesce the pages and then decide if we're going the nfs_flush_multi >> or the nfs_flush_one way, based on the coalesced length. > > No! Ignoring the wsize is definitely wrong... If the stripe size is > larger than the 'maxwrite' recommended attribute, then the DS is allowed > to do a short write, in which case we have to resend. > As far as I could understand the current code, desc->bsize which derives from wsize/rsize is negotiated with the MDS. But the wsize in question is the DS's one. So I think in pnfs it is only the layout-driver that can check this properly against the filer in question. Only if IO is to go through MDS the bsize check is relevant. BTW: The BUG_ON() Andy hit, does not look like an hard bug to fix ;-) > In any case, nfs_flush_multi and nfs_flush_one need a rewrite in order > to deal properly with O_DIRECT writes, and so I'm expecting to get rid > of the single nfs_page limit for the r/wsize<PAGE_SIZE case. > Please don't make any large changes to this code at this time. > Amen, Good riddance Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2011-06-01 at 22:29 +0300, Boaz Harrosh wrote: > On 06/01/2011 10:17 PM, Trond Myklebust wrote: > > On Wed, 2011-06-01 at 21:56 +0300, Benny Halevy wrote: > > > >> For pnfs, we need to ignore wsize, meaning we first need to try > >> to coalesce the pages and then decide if we're going the nfs_flush_multi > >> or the nfs_flush_one way, based on the coalesced length. > > > > No! Ignoring the wsize is definitely wrong... If the stripe size is > > larger than the 'maxwrite' recommended attribute, then the DS is allowed > > to do a short write, in which case we have to resend. > > > > As far as I could understand the current code, desc->bsize which derives > from wsize/rsize is negotiated with the MDS. But the wsize in question > is the DS's one. So I think in pnfs it is only the layout-driver that > can check this properly against the filer in question. Only if IO is > to go through MDS the bsize check is relevant. There is no distinction between the MDS's idea of the wsize and the DS's idea of the wsize. The DS has to support the same wsize as the MDS, since the client has no way of querying the wsize from the DS (GETATTR maxwrite is not on the allowed set of DS operations)... > BTW: The BUG_ON() Andy hit, does not look like an hard bug to fix ;-) > > > In any case, nfs_flush_multi and nfs_flush_one need a rewrite in order > > to deal properly with O_DIRECT writes, and so I'm expecting to get rid > > of the single nfs_page limit for the r/wsize<PAGE_SIZE case. > > Please don't make any large changes to this code at this time. > > > > Amen, Good riddance :-)
On 06/01/2011 10:38 PM, Trond Myklebust wrote: > > There is no distinction between the MDS's idea of the wsize and the DS's > idea of the wsize. The DS has to support the same wsize as the MDS, > since the client has no way of querying the wsize from the DS (GETATTR > maxwrite is not on the allowed set of DS operations)... > OK I did not know that. Good god how broken is that? Any way you are probably right. But then the check properly belongs in files-layout driver. Surly it has nothing to do with objects, and blocks Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2011-06-01 at 22:49 +0300, Boaz Harrosh wrote: > On 06/01/2011 10:38 PM, Trond Myklebust wrote: > > > > There is no distinction between the MDS's idea of the wsize and the DS's > > idea of the wsize. The DS has to support the same wsize as the MDS, > > since the client has no way of querying the wsize from the DS (GETATTR > > maxwrite is not on the allowed set of DS operations)... > > > > OK I did not know that. Good god how broken is that? > > Any way you are probably right. But then the check properly belongs in > files-layout driver. Surly it has nothing to do with objects, and blocks It belongs in the 'files' layout and in the generic (i.e. v2/v3/v4) code.
On 2011-06-01 22:29, Trond Myklebust wrote: > On Wed, 2011-06-01 at 22:13 +0300, Benny Halevy wrote: >> On 2011-06-01 21:07, Trond Myklebust wrote: >>> On Wed, 2011-06-01 at 17:51 +0300, Benny Halevy wrote: >>>> I think the following should work: >>>> >>>> Benny >>>> >>>> git diff --stat -p -M >>>> fs/nfs/nfs4filelayout.c | 10 ++++++++++ >>>> 1 files changed, 10 insertions(+), 0 deletions(-) >>>> >>>> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c >>>> index 4269088..9f1d445 100644 >>>> --- a/fs/nfs/nfs4filelayout.c >>>> +++ b/fs/nfs/nfs4filelayout.c >>>> @@ -661,6 +661,16 @@ filelayout_pg_test(struct nfs_pageio_descriptor >>>> *pgio, struct nfs_page *prev, >>>> u64 p_stripe, r_stripe; >>>> u32 stripe_unit; >>>> >>>> + /* >>>> + * FIXME: ideally we should be able to coalesce all requests >>>> + * that are not block boundary aligned, but currently this >>>> + * is problematic for the case of bsize < PAGE_CACHE_SIZE, >>>> + * since nfs_flush_multi and nfs_pagein_multi assume you >>>> + * can have only one struct nfs_page. >>>> + */ >>>> + if (desc->pg_bsize < PAGE_SIZE) >>>> + return 0; >>>> + >>>> if (!pnfs_generic_pg_test(pgio, prev, req)) >>>> return 0; >>> >>> So, there are several things that bother me about pnfs_generic_pg_test() >>> too now that I'm looking more closely at it: >>> >>> 1. If the intention is to coalesce 'prev' and 'req', shouldn't we >>> be asking for a layout with req_offset(prev) instead of >>> req_offset(req)? >>> 2. If we're only requesting a layout of length pg_count, don't we >>> still need to test the layout length that the server actually >>> returned before we can allow the coalescing? >>> 3. if (!pgio->lseg), shouldn't we be returning an error of some >>> sort? Right now we're returning 'true', and allowing the >>> coalesce to occur. >>> 4. Furthermore, shouldn't that test guarding the >>> pnfs_update_layout() call rather be an 'if (pgio->pg_lseg == >>> NULL)' instead of looking at the values of pg_count and >>> prev->wb_bytes? >>> >> >> or rather we get the layout for the first page in >> nfs_pageio_do_add_request when desc->pg_count == 0? > > I can live with a desc->pg_init() callback or rather, converting > pg_test() and pg_doio() into a > > struct nfs_pageio_ops { > int (*pg_init)(struct nfs_pageio_descriptor *desc, struct nfs_page *req); > bool (*pg_test)(struct nfs_pageio_descriptor *desc, struct nfs_page *prev, struct nfs_page *req); > int (*pg_doio)(struct nfs_pageio_descriptor *desc); > }; > > and then replacing the two callback functions in the existing struct > nfs_pageio_descriptor with a single pointer to a 'const struct > nfs_pageio_ops'... > looks like a good way to do this! >> Then, this lseg would be useful for nfs_flush_multi >> if we failed to coalesce, or we failed to get a layout >> altogether we go the nfs path and can reset pg_test to >> nfs_generic_pg_test. > > It would presumably also get rid of all those pnfs_update_layout() calls > in read.c and write.c. > Yup. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 1, 2011 at 4:09 PM, Benny Halevy <bhalevy@panasas.com> wrote: > On 2011-06-01 22:29, Trond Myklebust wrote: >> On Wed, 2011-06-01 at 22:13 +0300, Benny Halevy wrote: >>> On 2011-06-01 21:07, Trond Myklebust wrote: >>>> On Wed, 2011-06-01 at 17:51 +0300, Benny Halevy wrote: >>>>> I think the following should work: >>>>> >>>>> Benny >>>>> >>>>> git diff --stat -p -M >>>>> fs/nfs/nfs4filelayout.c | 10 ++++++++++ >>>>> 1 files changed, 10 insertions(+), 0 deletions(-) >>>>> >>>>> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c >>>>> index 4269088..9f1d445 100644 >>>>> --- a/fs/nfs/nfs4filelayout.c >>>>> +++ b/fs/nfs/nfs4filelayout.c >>>>> @@ -661,6 +661,16 @@ filelayout_pg_test(struct nfs_pageio_descriptor >>>>> *pgio, struct nfs_page *prev, >>>>> u64 p_stripe, r_stripe; >>>>> u32 stripe_unit; >>>>> >>>>> + /* >>>>> + * FIXME: ideally we should be able to coalesce all requests >>>>> + * that are not block boundary aligned, but currently this >>>>> + * is problematic for the case of bsize < PAGE_CACHE_SIZE, >>>>> + * since nfs_flush_multi and nfs_pagein_multi assume you >>>>> + * can have only one struct nfs_page. >>>>> + */ >>>>> + if (desc->pg_bsize < PAGE_SIZE) >>>>> + return 0; >>>>> + >>>>> if (!pnfs_generic_pg_test(pgio, prev, req)) >>>>> return 0; >>>> >>>> So, there are several things that bother me about pnfs_generic_pg_test() >>>> too now that I'm looking more closely at it: >>>> >>>> 1. If the intention is to coalesce 'prev' and 'req', shouldn't we >>>> be asking for a layout with req_offset(prev) instead of >>>> req_offset(req)? >>>> 2. If we're only requesting a layout of length pg_count, don't we >>>> still need to test the layout length that the server actually >>>> returned before we can allow the coalescing? >>>> 3. if (!pgio->lseg), shouldn't we be returning an error of some >>>> sort? Right now we're returning 'true', and allowing the >>>> coalesce to occur. >>>> 4. Furthermore, shouldn't that test guarding the >>>> pnfs_update_layout() call rather be an 'if (pgio->pg_lseg == >>>> NULL)' instead of looking at the values of pg_count and >>>> prev->wb_bytes? >>>> >>> >>> or rather we get the layout for the first page in >>> nfs_pageio_do_add_request when desc->pg_count == 0? >> >> I can live with a desc->pg_init() callback or rather, converting >> pg_test() and pg_doio() into a >> >> struct nfs_pageio_ops { >> int (*pg_init)(struct nfs_pageio_descriptor *desc, struct nfs_page *req); >> bool (*pg_test)(struct nfs_pageio_descriptor *desc, struct nfs_page *prev, struct nfs_page *req); >> int (*pg_doio)(struct nfs_pageio_descriptor *desc); >> }; >> >> and then replacing the two callback functions in the existing struct >> nfs_pageio_descriptor with a single pointer to a 'const struct >> nfs_pageio_ops'... >> > > looks like a good way to do this! Is anyone coding this fix? -->Andy > >>> Then, this lseg would be useful for nfs_flush_multi >>> if we failed to coalesce, or we failed to get a layout >>> altogether we go the nfs path and can reset pg_test to >>> nfs_generic_pg_test. >> >> It would presumably also get rid of all those pnfs_update_layout() calls >> in read.c and write.c. >> > > Yup. > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2011-06-06 12:47, William A. (Andy) Adamson wrote: > On Wed, Jun 1, 2011 at 4:09 PM, Benny Halevy <bhalevy@panasas.com> wrote: >> On 2011-06-01 22:29, Trond Myklebust wrote: >>> On Wed, 2011-06-01 at 22:13 +0300, Benny Halevy wrote: >>>> On 2011-06-01 21:07, Trond Myklebust wrote: >>>>> On Wed, 2011-06-01 at 17:51 +0300, Benny Halevy wrote: >>>>>> I think the following should work: >>>>>> >>>>>> Benny >>>>>> >>>>>> git diff --stat -p -M >>>>>> fs/nfs/nfs4filelayout.c | 10 ++++++++++ >>>>>> 1 files changed, 10 insertions(+), 0 deletions(-) >>>>>> >>>>>> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c >>>>>> index 4269088..9f1d445 100644 >>>>>> --- a/fs/nfs/nfs4filelayout.c >>>>>> +++ b/fs/nfs/nfs4filelayout.c >>>>>> @@ -661,6 +661,16 @@ filelayout_pg_test(struct nfs_pageio_descriptor >>>>>> *pgio, struct nfs_page *prev, >>>>>> u64 p_stripe, r_stripe; >>>>>> u32 stripe_unit; >>>>>> >>>>>> + /* >>>>>> + * FIXME: ideally we should be able to coalesce all requests >>>>>> + * that are not block boundary aligned, but currently this >>>>>> + * is problematic for the case of bsize < PAGE_CACHE_SIZE, >>>>>> + * since nfs_flush_multi and nfs_pagein_multi assume you >>>>>> + * can have only one struct nfs_page. >>>>>> + */ >>>>>> + if (desc->pg_bsize < PAGE_SIZE) >>>>>> + return 0; >>>>>> + >>>>>> if (!pnfs_generic_pg_test(pgio, prev, req)) >>>>>> return 0; >>>>> >>>>> So, there are several things that bother me about pnfs_generic_pg_test() >>>>> too now that I'm looking more closely at it: >>>>> >>>>> 1. If the intention is to coalesce 'prev' and 'req', shouldn't we >>>>> be asking for a layout with req_offset(prev) instead of >>>>> req_offset(req)? >>>>> 2. If we're only requesting a layout of length pg_count, don't we >>>>> still need to test the layout length that the server actually >>>>> returned before we can allow the coalescing? >>>>> 3. if (!pgio->lseg), shouldn't we be returning an error of some >>>>> sort? Right now we're returning 'true', and allowing the >>>>> coalesce to occur. >>>>> 4. Furthermore, shouldn't that test guarding the >>>>> pnfs_update_layout() call rather be an 'if (pgio->pg_lseg == >>>>> NULL)' instead of looking at the values of pg_count and >>>>> prev->wb_bytes? >>>>> >>>> >>>> or rather we get the layout for the first page in >>>> nfs_pageio_do_add_request when desc->pg_count == 0? >>> >>> I can live with a desc->pg_init() callback or rather, converting >>> pg_test() and pg_doio() into a >>> >>> struct nfs_pageio_ops { >>> int (*pg_init)(struct nfs_pageio_descriptor *desc, struct nfs_page *req); >>> bool (*pg_test)(struct nfs_pageio_descriptor *desc, struct nfs_page *prev, struct nfs_page *req); >>> int (*pg_doio)(struct nfs_pageio_descriptor *desc); >>> }; >>> >>> and then replacing the two callback functions in the existing struct >>> nfs_pageio_descriptor with a single pointer to a 'const struct >>> nfs_pageio_ops'... >>> >> >> looks like a good way to do this! > > Is anyone coding this fix? > I started working on this but switched to porting forward spnfs and spnfs-block (which I've just pushed out). Benny > -->Andy > >> >>>> Then, this lseg would be useful for nfs_flush_multi >>>> if we failed to coalesce, or we failed to get a layout >>>> altogether we go the nfs path and can reset pg_test to >>>> nfs_generic_pg_test. >>> >>> It would presumably also get rid of all those pnfs_update_layout() calls >>> in read.c and write.c. >>> >> >> Yup. >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I'm working on it. I'm doing a lot of surgery in that general area anyway... -----Original Message----- From: Benny Halevy [mailto:bhalevy@panasas.com] Sent: Monday, June 06, 2011 2:21 PM To: William A. (Andy) Adamson Cc: Myklebust, Trond; Adamson, Dros; Boaz Harrosh; Myklebust, Trond; linux-nfs@vger.kernel.org Subject: Re: [PATCH] NFS: filelayout should use nfs_generic_pg_test On 2011-06-06 12:47, William A. (Andy) Adamson wrote: > On Wed, Jun 1, 2011 at 4:09 PM, Benny Halevy <bhalevy@panasas.com> wrote: >> On 2011-06-01 22:29, Trond Myklebust wrote: >>> On Wed, 2011-06-01 at 22:13 +0300, Benny Halevy wrote: >>>> On 2011-06-01 21:07, Trond Myklebust wrote: >>>>> On Wed, 2011-06-01 at 17:51 +0300, Benny Halevy wrote: >>>>>> I think the following should work: >>>>>> >>>>>> Benny >>>>>> >>>>>> git diff --stat -p -M >>>>>> fs/nfs/nfs4filelayout.c | 10 ++++++++++ >>>>>> 1 files changed, 10 insertions(+), 0 deletions(-) >>>>>> >>>>>> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c >>>>>> index 4269088..9f1d445 100644 >>>>>> --- a/fs/nfs/nfs4filelayout.c >>>>>> +++ b/fs/nfs/nfs4filelayout.c >>>>>> @@ -661,6 +661,16 @@ filelayout_pg_test(struct nfs_pageio_descriptor >>>>>> *pgio, struct nfs_page *prev, >>>>>> u64 p_stripe, r_stripe; >>>>>> u32 stripe_unit; >>>>>> >>>>>> + /* >>>>>> + * FIXME: ideally we should be able to coalesce all requests >>>>>> + * that are not block boundary aligned, but currently this >>>>>> + * is problematic for the case of bsize < PAGE_CACHE_SIZE, >>>>>> + * since nfs_flush_multi and nfs_pagein_multi assume you >>>>>> + * can have only one struct nfs_page. >>>>>> + */ >>>>>> + if (desc->pg_bsize < PAGE_SIZE) >>>>>> + return 0; >>>>>> + >>>>>> if (!pnfs_generic_pg_test(pgio, prev, req)) >>>>>> return 0; >>>>> >>>>> So, there are several things that bother me about pnfs_generic_pg_test() >>>>> too now that I'm looking more closely at it: >>>>> >>>>> 1. If the intention is to coalesce 'prev' and 'req', shouldn't we >>>>> be asking for a layout with req_offset(prev) instead of >>>>> req_offset(req)? >>>>> 2. If we're only requesting a layout of length pg_count, don't we >>>>> still need to test the layout length that the server actually >>>>> returned before we can allow the coalescing? >>>>> 3. if (!pgio->lseg), shouldn't we be returning an error of some >>>>> sort? Right now we're returning 'true', and allowing the >>>>> coalesce to occur. >>>>> 4. Furthermore, shouldn't that test guarding the >>>>> pnfs_update_layout() call rather be an 'if (pgio->pg_lseg == >>>>> NULL)' instead of looking at the values of pg_count and >>>>> prev->wb_bytes? >>>>> >>>> >>>> or rather we get the layout for the first page in >>>> nfs_pageio_do_add_request when desc->pg_count == 0? >>> >>> I can live with a desc->pg_init() callback or rather, converting >>> pg_test() and pg_doio() into a >>> >>> struct nfs_pageio_ops { >>> int (*pg_init)(struct nfs_pageio_descriptor *desc, struct nfs_page *req); >>> bool (*pg_test)(struct nfs_pageio_descriptor *desc, struct nfs_page *prev, struct nfs_page *req); >>> int (*pg_doio)(struct nfs_pageio_descriptor *desc); >>> }; >>> >>> and then replacing the two callback functions in the existing struct >>> nfs_pageio_descriptor with a single pointer to a 'const struct >>> nfs_pageio_ops'... >>> >> >> looks like a good way to do this! > > Is anyone coding this fix? > I started working on this but switched to porting forward spnfs and spnfs-block (which I've just pushed out). Benny > -->Andy > >> >>>> Then, this lseg would be useful for nfs_flush_multi >>>> if we failed to coalesce, or we failed to get a layout >>>> altogether we go the nfs path and can reset pg_test to >>>> nfs_generic_pg_test. >>> >>> It would presumably also get rid of all those pnfs_update_layout() calls >>> in read.c and write.c. >>> >> >> Yup. >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c index 4269088..9f1d445 100644 --- a/fs/nfs/nfs4filelayout.c +++ b/fs/nfs/nfs4filelayout.c @@ -661,6 +661,16 @@ filelayout_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev, u64 p_stripe, r_stripe; u32 stripe_unit; + /* + * FIXME: ideally we should be able to coalesce all requests + * that are not block boundary aligned, but currently this + * is problematic for the case of bsize < PAGE_CACHE_SIZE, + * since nfs_flush_multi and nfs_pagein_multi assume you + * can have only one struct nfs_page. + */ + if (desc->pg_bsize < PAGE_SIZE) + return 0; + if (!pnfs_generic_pg_test(pgio, prev, req)) return 0;