ceph: Update the pages in fscache in writepages() path

Message ID	1383278208-4053-1-git-send-email-liwang@ubuntukylin.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <ceph-devel-owner@kernel.org> From: Li Wang <liwang@ubuntukylin.com> To: ceph-devel@vger.kernel.org Cc: linux-cachefs@redhat.com, Sage Weil <sage@inktank.com>, linux-kernel@vger.kernel.org, Li Wang <liwang@ubuntukylin.com>, Min Chen <minchen@ubuntukylin.com>, Yunchuan Wen <yunchuanwen@ubuntukylin.com> Subject: [PATCH] ceph: Update the pages in fscache in writepages() path Date: Fri, 1 Nov 2013 11:56:48 +0800 Message-Id: <1383278208-4053-1-git-send-email-liwang@ubuntukylin.com> Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk

Li Wang Nov. 1, 2013, 3:56 a.m. UTC

Currently, the pages in fscache only are updated in writepage() path,
add the process in writepages().

Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
---
 fs/ceph/addr.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

Milosz Tanski Nov. 5, 2013, 5:30 p.m. UTC | #1

Li,

First, sorry for the late reply on this.

Currently fscache is only supported for files that are open in read
only mode. I originally was going to let fscache cache in the write
path as well as long as the file was open in with O_LAZY. I abandoned
that idea. When a user opens the file in O_LAZY we can cache things
locally with the assumption that the user will care of the
synchronization in some other manner. There is no way of invalidating
a subset of the pages in object cached by fscache, there is no way we
can make O_LAZY work well.

The ceph_readpage_to_fscache() in writepage has no effect and it
should be removed. ceph_readpage_to_fscache() calls cache_valid() to
see if it should perform the page save, and since the file can't have
a CACHE cap at the point in time it doesn't do it.

Thanks,
- Milosz

On Thu, Oct 31, 2013 at 11:56 PM, Li Wang <liwang@ubuntukylin.com> wrote:
> Currently, the pages in fscache only are updated in writepage() path,
> add the process in writepages().
>
> Signed-off-by: Min Chen <minchen@ubuntukylin.com>
> Signed-off-by: Li Wang <liwang@ubuntukylin.com>
> Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
> ---
>  fs/ceph/addr.c |    8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 6df8bd4..cc57911 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -746,7 +746,7 @@ retry:
>
>         while (!done && index <= end) {
>                 int num_ops = do_sync ? 2 : 1;
> -               unsigned i;
> +               unsigned i, j;
>                 int first;
>                 pgoff_t next;
>                 int pvec_pages, locked_pages;
> @@ -894,7 +894,6 @@ get_more_pages:
>                 if (!locked_pages)
>                         goto release_pvec_pages;
>                 if (i) {
> -                       int j;
>                         BUG_ON(!locked_pages || first < 0);
>
>                         if (pvec_pages && i == pvec_pages &&
> @@ -924,7 +923,10 @@ get_more_pages:
>
>                 osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0,
>                                                         !!pool, false);
> -
> +               for(j = 0; j < locked_pages; j++) {
> +                       struct page *page = pages[j];
> +                       ceph_readpage_to_fscache(inode, page);
> +               }
>                 pages = NULL;   /* request message now owns the pages array */
>                 pool = NULL;
>
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil Nov. 5, 2013, 11:56 p.m. UTC | #2

On Tue, 5 Nov 2013, Milosz Tanski wrote:
> Li,
> 
> First, sorry for the late reply on this.
> 
> Currently fscache is only supported for files that are open in read
> only mode. I originally was going to let fscache cache in the write
> path as well as long as the file was open in with O_LAZY. I abandoned
> that idea. When a user opens the file in O_LAZY we can cache things
> locally with the assumption that the user will care of the
> synchronization in some other manner. There is no way of invalidating
> a subset of the pages in object cached by fscache, there is no way we
> can make O_LAZY work well.
> 
> The ceph_readpage_to_fscache() in writepage has no effect and it
> should be removed. ceph_readpage_to_fscache() calls cache_valid() to
> see if it should perform the page save, and since the file can't have
> a CACHE cap at the point in time it doesn't do it.

(Hmm, Dusting off my understanding of fscache and reading 
fs/ceph/cache.c; watch out!)  It looks like cache_valid is

static inline int cache_valid(struct ceph_inode_info *ci)
{
	return ((ceph_caps_issued(ci) & CEPH_CAP_FILE_CACHE) &&
		(ci->i_fscache_gen == ci->i_rdcache_gen));
}

and in the FILE_EXCL case, the MDS will issue CACHE|BUFFER caps.  But I 
think the aux key (size+mtime) will prevent any use of the cache as soon 
as the first write happens and mtime changes, right?

I think that in order to make this work, we need to fix/create a 
file_version (or something similar) field in the (mds) inode_t to have 
some useful value.  I.e., increment it any time

 - a different client/writer comes along
 - a file is modified by the mds (e.g., truncated or recovered)

but allow it to otherwise remain the same as long as only a single client 
is working with the file exclusively.  This will be more precise than the 
(size, mtime) check that is currently used, and would remain valid when a 
single client opens the same file for exclusive read/write multiple times 
but there are no other intervening changes.

Milosz, if that were in place, is there any reason not to wire up 
writepage and allow the fscache to be used write-through?

sage




> 
> Thanks,
> - Milosz
> 
> On Thu, Oct 31, 2013 at 11:56 PM, Li Wang <liwang@ubuntukylin.com> wrote:
> > Currently, the pages in fscache only are updated in writepage() path,
> > add the process in writepages().
> >
> > Signed-off-by: Min Chen <minchen@ubuntukylin.com>
> > Signed-off-by: Li Wang <liwang@ubuntukylin.com>
> > Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
> > ---
> >  fs/ceph/addr.c |    8 +++++---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > index 6df8bd4..cc57911 100644
> > --- a/fs/ceph/addr.c
> > +++ b/fs/ceph/addr.c
> > @@ -746,7 +746,7 @@ retry:
> >
> >         while (!done && index <= end) {
> >                 int num_ops = do_sync ? 2 : 1;
> > -               unsigned i;
> > +               unsigned i, j;
> >                 int first;
> >                 pgoff_t next;
> >                 int pvec_pages, locked_pages;
> > @@ -894,7 +894,6 @@ get_more_pages:
> >                 if (!locked_pages)
> >                         goto release_pvec_pages;
> >                 if (i) {
> > -                       int j;
> >                         BUG_ON(!locked_pages || first < 0);
> >
> >                         if (pvec_pages && i == pvec_pages &&
> > @@ -924,7 +923,10 @@ get_more_pages:
> >
> >                 osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0,
> >                                                         !!pool, false);
> > -
> > +               for(j = 0; j < locked_pages; j++) {
> > +                       struct page *page = pages[j];
> > +                       ceph_readpage_to_fscache(inode, page);
> > +               }
> >                 pages = NULL;   /* request message now owns the pages array */
> >                 pool = NULL;
> >
> > --
> > 1.7.9.5
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Milosz Tanski
> CTO
> 10 East 53rd Street, 37th floor
> New York, NY 10022
> 
> p: 646-253-9055
> e: milosz@adfin.com
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Milosz Tanski Nov. 6, 2013, 1:41 p.m. UTC | #3

Sage,

I think the incrementing version counter on the whole is a neater
solution then using size and mtime. If nothing else it's more explicit
in the the read cache version. With what you suggested plus additional
changes to the open code (where the cookie gets created) the
write-through scenario should be correct.

Sadly, my understanding of the MDS protocol is still not great. So
when doing this in the first place I erred on the side of using what
was already in place.

In a kind of un-related question. Is there a debug hook in the kclient
(or MDS for that matter) to dump the current file inodes (names) with
issues caps and to which hosts. This would be very helpful for
debugging, since from time to time I see a one of the clients get
stuck in getattr (via mdsc debug log).

Thanks,
- Milosz

On Tue, Nov 5, 2013 at 6:56 PM, Sage Weil <sage@inktank.com> wrote:
> On Tue, 5 Nov 2013, Milosz Tanski wrote:
>> Li,
>>
>> First, sorry for the late reply on this.
>>
>> Currently fscache is only supported for files that are open in read
>> only mode. I originally was going to let fscache cache in the write
>> path as well as long as the file was open in with O_LAZY. I abandoned
>> that idea. When a user opens the file in O_LAZY we can cache things
>> locally with the assumption that the user will care of the
>> synchronization in some other manner. There is no way of invalidating
>> a subset of the pages in object cached by fscache, there is no way we
>> can make O_LAZY work well.
>>
>> The ceph_readpage_to_fscache() in writepage has no effect and it
>> should be removed. ceph_readpage_to_fscache() calls cache_valid() to
>> see if it should perform the page save, and since the file can't have
>> a CACHE cap at the point in time it doesn't do it.
>
> (Hmm, Dusting off my understanding of fscache and reading
> fs/ceph/cache.c; watch out!)  It looks like cache_valid is
>
> static inline int cache_valid(struct ceph_inode_info *ci)
> {
>         return ((ceph_caps_issued(ci) & CEPH_CAP_FILE_CACHE) &&
>                 (ci->i_fscache_gen == ci->i_rdcache_gen));
> }
>
> and in the FILE_EXCL case, the MDS will issue CACHE|BUFFER caps.  But I
> think the aux key (size+mtime) will prevent any use of the cache as soon
> as the first write happens and mtime changes, right?
>
> I think that in order to make this work, we need to fix/create a
> file_version (or something similar) field in the (mds) inode_t to have
> some useful value.  I.e., increment it any time
>
>  - a different client/writer comes along
>  - a file is modified by the mds (e.g., truncated or recovered)
>
> but allow it to otherwise remain the same as long as only a single client
> is working with the file exclusively.  This will be more precise than the
> (size, mtime) check that is currently used, and would remain valid when a
> single client opens the same file for exclusive read/write multiple times
> but there are no other intervening changes.
>
> Milosz, if that were in place, is there any reason not to wire up
> writepage and allow the fscache to be used write-through?
>
> sage
>
>
>
>
>>
>> Thanks,
>> - Milosz
>>
>> On Thu, Oct 31, 2013 at 11:56 PM, Li Wang <liwang@ubuntukylin.com> wrote:
>> > Currently, the pages in fscache only are updated in writepage() path,
>> > add the process in writepages().
>> >
>> > Signed-off-by: Min Chen <minchen@ubuntukylin.com>
>> > Signed-off-by: Li Wang <liwang@ubuntukylin.com>
>> > Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
>> > ---
>> >  fs/ceph/addr.c |    8 +++++---
>> >  1 file changed, 5 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>> > index 6df8bd4..cc57911 100644
>> > --- a/fs/ceph/addr.c
>> > +++ b/fs/ceph/addr.c
>> > @@ -746,7 +746,7 @@ retry:
>> >
>> >         while (!done && index <= end) {
>> >                 int num_ops = do_sync ? 2 : 1;
>> > -               unsigned i;
>> > +               unsigned i, j;
>> >                 int first;
>> >                 pgoff_t next;
>> >                 int pvec_pages, locked_pages;
>> > @@ -894,7 +894,6 @@ get_more_pages:
>> >                 if (!locked_pages)
>> >                         goto release_pvec_pages;
>> >                 if (i) {
>> > -                       int j;
>> >                         BUG_ON(!locked_pages || first < 0);
>> >
>> >                         if (pvec_pages && i == pvec_pages &&
>> > @@ -924,7 +923,10 @@ get_more_pages:
>> >
>> >                 osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0,
>> >                                                         !!pool, false);
>> > -
>> > +               for(j = 0; j < locked_pages; j++) {
>> > +                       struct page *page = pages[j];
>> > +                       ceph_readpage_to_fscache(inode, page);
>> > +               }
>> >                 pages = NULL;   /* request message now owns the pages array */
>> >                 pool = NULL;
>> >
>> > --
>> > 1.7.9.5
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Milosz Tanski
>> CTO
>> 10 East 53rd Street, 37th floor
>> New York, NY 10022
>>
>> p: 646-253-9055
>> e: milosz@adfin.com
>>
>>

Yan, Zheng Nov. 6, 2013, 3:01 p.m. UTC | #4

On Wed, Nov 6, 2013 at 9:41 PM, Milosz Tanski <milosz@adfin.com> wrote:
> Sage,
>
> I think the incrementing version counter on the whole is a neater
> solution then using size and mtime. If nothing else it's more explicit
> in the the read cache version. With what you suggested plus additional
> changes to the open code (where the cookie gets created) the
> write-through scenario should be correct.
>
> Sadly, my understanding of the MDS protocol is still not great. So
> when doing this in the first place I erred on the side of using what
> was already in place.
>
> In a kind of un-related question. Is there a debug hook in the kclient
> (or MDS for that matter) to dump the current file inodes (names) with
> issues caps and to which hosts. This would be very helpful for
> debugging, since from time to time I see a one of the clients get
> stuck in getattr (via mdsc debug log).
>

"ceph mds tell \* dumpcache" dump the mds cache to a file. the dump
file contains caps information.

Regards
Yan, Zheng

> Thanks,
> - Milosz
>
> On Tue, Nov 5, 2013 at 6:56 PM, Sage Weil <sage@inktank.com> wrote:
>> On Tue, 5 Nov 2013, Milosz Tanski wrote:
>>> Li,
>>>
>>> First, sorry for the late reply on this.
>>>
>>> Currently fscache is only supported for files that are open in read
>>> only mode. I originally was going to let fscache cache in the write
>>> path as well as long as the file was open in with O_LAZY. I abandoned
>>> that idea. When a user opens the file in O_LAZY we can cache things
>>> locally with the assumption that the user will care of the
>>> synchronization in some other manner. There is no way of invalidating
>>> a subset of the pages in object cached by fscache, there is no way we
>>> can make O_LAZY work well.
>>>
>>> The ceph_readpage_to_fscache() in writepage has no effect and it
>>> should be removed. ceph_readpage_to_fscache() calls cache_valid() to
>>> see if it should perform the page save, and since the file can't have
>>> a CACHE cap at the point in time it doesn't do it.
>>
>> (Hmm, Dusting off my understanding of fscache and reading
>> fs/ceph/cache.c; watch out!)  It looks like cache_valid is
>>
>> static inline int cache_valid(struct ceph_inode_info *ci)
>> {
>>         return ((ceph_caps_issued(ci) & CEPH_CAP_FILE_CACHE) &&
>>                 (ci->i_fscache_gen == ci->i_rdcache_gen));
>> }
>>
>> and in the FILE_EXCL case, the MDS will issue CACHE|BUFFER caps.  But I
>> think the aux key (size+mtime) will prevent any use of the cache as soon
>> as the first write happens and mtime changes, right?
>>
>> I think that in order to make this work, we need to fix/create a
>> file_version (or something similar) field in the (mds) inode_t to have
>> some useful value.  I.e., increment it any time
>>
>>  - a different client/writer comes along
>>  - a file is modified by the mds (e.g., truncated or recovered)
>>
>> but allow it to otherwise remain the same as long as only a single client
>> is working with the file exclusively.  This will be more precise than the
>> (size, mtime) check that is currently used, and would remain valid when a
>> single client opens the same file for exclusive read/write multiple times
>> but there are no other intervening changes.
>>
>> Milosz, if that were in place, is there any reason not to wire up
>> writepage and allow the fscache to be used write-through?
>>
>> sage
>>
>>
>>
>>
>>>
>>> Thanks,
>>> - Milosz
>>>
>>> On Thu, Oct 31, 2013 at 11:56 PM, Li Wang <liwang@ubuntukylin.com> wrote:
>>> > Currently, the pages in fscache only are updated in writepage() path,
>>> > add the process in writepages().
>>> >
>>> > Signed-off-by: Min Chen <minchen@ubuntukylin.com>
>>> > Signed-off-by: Li Wang <liwang@ubuntukylin.com>
>>> > Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
>>> > ---
>>> >  fs/ceph/addr.c |    8 +++++---
>>> >  1 file changed, 5 insertions(+), 3 deletions(-)
>>> >
>>> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>>> > index 6df8bd4..cc57911 100644
>>> > --- a/fs/ceph/addr.c
>>> > +++ b/fs/ceph/addr.c
>>> > @@ -746,7 +746,7 @@ retry:
>>> >
>>> >         while (!done && index <= end) {
>>> >                 int num_ops = do_sync ? 2 : 1;
>>> > -               unsigned i;
>>> > +               unsigned i, j;
>>> >                 int first;
>>> >                 pgoff_t next;
>>> >                 int pvec_pages, locked_pages;
>>> > @@ -894,7 +894,6 @@ get_more_pages:
>>> >                 if (!locked_pages)
>>> >                         goto release_pvec_pages;
>>> >                 if (i) {
>>> > -                       int j;
>>> >                         BUG_ON(!locked_pages || first < 0);
>>> >
>>> >                         if (pvec_pages && i == pvec_pages &&
>>> > @@ -924,7 +923,10 @@ get_more_pages:
>>> >
>>> >                 osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0,
>>> >                                                         !!pool, false);
>>> > -
>>> > +               for(j = 0; j < locked_pages; j++) {
>>> > +                       struct page *page = pages[j];
>>> > +                       ceph_readpage_to_fscache(inode, page);
>>> > +               }
>>> >                 pages = NULL;   /* request message now owns the pages array */
>>> >                 pool = NULL;
>>> >
>>> > --
>>> > 1.7.9.5
>>> >
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> > the body of a message to majordomo@vger.kernel.org
>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Milosz Tanski
>>> CTO
>>> 10 East 53rd Street, 37th floor
>>> New York, NY 10022
>>>
>>> p: 646-253-9055
>>> e: milosz@adfin.com
>>>
>>>
>
>
>
> --
> Milosz Tanski
> CTO
> 10 East 53rd Street, 37th floor
> New York, NY 10022
>
> p: 646-253-9055
> e: milosz@adfin.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Milosz Tanski Nov. 6, 2013, 4:38 p.m. UTC | #5

On Wed, Nov 6, 2013 at 10:38 AM, Sage Weil <sage@inktank.com> wrote:
> On Wed, 6 Nov 2013, Milosz Tanski wrote:
>> Sage,
>>
>> I think the incrementing version counter on the whole is a neater
>> solution then using size and mtime. If nothing else it's more explicit
>> in the the read cache version. With what you suggested plus additional
>> changes to the open code (where the cookie gets created) the
>> write-through scenario should be correct.
>
> Great to hear.  I don't think it will be much work on the MDS; it'll just
> take a bit of care to catch the relevant cases.

Just let me know any way I can help that along.

>
>> Sadly, my understanding of the MDS protocol is still not great. So
>> when doing this in the first place I erred on the side of using what
>> was already in place.
>
> Of course!
>
>> In a kind of un-related question. Is there a debug hook in the kclient
>> (or MDS for that matter) to dump the current file inodes (names) with
>> issues caps and to which hosts. This would be very helpful for
>> debugging, since from time to time I see a one of the clients get
>> stuck in getattr (via mdsc debug log).
>
> cat /sys/kernel/debug/ceph/*/mdsc
>
> will show requests in flight, but nothing right now shows the caps.  That
> would be super valuable, thoguh.  I'm not sure if there is a conveient way
> to iterate over the kmem_cache.. :/

That's how I can confirm that I am experiencing the issue. I cat mdsc
and I end up seeing one (or more) getattr requests that sit there
stubbornly. If I figure out which file it is on the filesystem and try
to access its metadata on the filesystem those requests hang in a
similar fashion (no matter what node it is).

I've been trying to figure out the root cause of it for a while now.
And so far I can't tell if one of the applications misbehaving, MDS or
kclient. To make matters worse it can take weeks to encounter it
again. The dumpcache feature that Yan pointed me at should give me a
new tool in debugging it.

Being able to see the same same kind of view from the client side
would be supper helpful in debugging. I'll take a look at the code to
see if we can iterate through the inodes for a mount point in a
strait-forward manner to make it happen.

>
> s
>
>>
>> Thanks,
>> - Milosz
>>
>> On Tue, Nov 5, 2013 at 6:56 PM, Sage Weil <sage@inktank.com> wrote:
>> > On Tue, 5 Nov 2013, Milosz Tanski wrote:
>> >> Li,
>> >>
>> >> First, sorry for the late reply on this.
>> >>
>> >> Currently fscache is only supported for files that are open in read
>> >> only mode. I originally was going to let fscache cache in the write
>> >> path as well as long as the file was open in with O_LAZY. I abandoned
>> >> that idea. When a user opens the file in O_LAZY we can cache things
>> >> locally with the assumption that the user will care of the
>> >> synchronization in some other manner. There is no way of invalidating
>> >> a subset of the pages in object cached by fscache, there is no way we
>> >> can make O_LAZY work well.
>> >>
>> >> The ceph_readpage_to_fscache() in writepage has no effect and it
>> >> should be removed. ceph_readpage_to_fscache() calls cache_valid() to
>> >> see if it should perform the page save, and since the file can't have
>> >> a CACHE cap at the point in time it doesn't do it.
>> >
>> > (Hmm, Dusting off my understanding of fscache and reading
>> > fs/ceph/cache.c; watch out!)  It looks like cache_valid is
>> >
>> > static inline int cache_valid(struct ceph_inode_info *ci)
>> > {
>> >         return ((ceph_caps_issued(ci) & CEPH_CAP_FILE_CACHE) &&
>> >                 (ci->i_fscache_gen == ci->i_rdcache_gen));
>> > }
>> >
>> > and in the FILE_EXCL case, the MDS will issue CACHE|BUFFER caps.  But I
>> > think the aux key (size+mtime) will prevent any use of the cache as soon
>> > as the first write happens and mtime changes, right?
>> >
>> > I think that in order to make this work, we need to fix/create a
>> > file_version (or something similar) field in the (mds) inode_t to have
>> > some useful value.  I.e., increment it any time
>> >
>> >  - a different client/writer comes along
>> >  - a file is modified by the mds (e.g., truncated or recovered)
>> >
>> > but allow it to otherwise remain the same as long as only a single client
>> > is working with the file exclusively.  This will be more precise than the
>> > (size, mtime) check that is currently used, and would remain valid when a
>> > single client opens the same file for exclusive read/write multiple times
>> > but there are no other intervening changes.
>> >
>> > Milosz, if that were in place, is there any reason not to wire up
>> > writepage and allow the fscache to be used write-through?
>> >
>> > sage
>> >
>> >
>> >
>> >
>> >>
>> >> Thanks,
>> >> - Milosz
>> >>
>> >> On Thu, Oct 31, 2013 at 11:56 PM, Li Wang <liwang@ubuntukylin.com> wrote:
>> >> > Currently, the pages in fscache only are updated in writepage() path,
>> >> > add the process in writepages().
>> >> >
>> >> > Signed-off-by: Min Chen <minchen@ubuntukylin.com>
>> >> > Signed-off-by: Li Wang <liwang@ubuntukylin.com>
>> >> > Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
>> >> > ---
>> >> >  fs/ceph/addr.c |    8 +++++---
>> >> >  1 file changed, 5 insertions(+), 3 deletions(-)
>> >> >
>> >> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>> >> > index 6df8bd4..cc57911 100644
>> >> > --- a/fs/ceph/addr.c
>> >> > +++ b/fs/ceph/addr.c
>> >> > @@ -746,7 +746,7 @@ retry:
>> >> >
>> >> >         while (!done && index <= end) {
>> >> >                 int num_ops = do_sync ? 2 : 1;
>> >> > -               unsigned i;
>> >> > +               unsigned i, j;
>> >> >                 int first;
>> >> >                 pgoff_t next;
>> >> >                 int pvec_pages, locked_pages;
>> >> > @@ -894,7 +894,6 @@ get_more_pages:
>> >> >                 if (!locked_pages)
>> >> >                         goto release_pvec_pages;
>> >> >                 if (i) {
>> >> > -                       int j;
>> >> >                         BUG_ON(!locked_pages || first < 0);
>> >> >
>> >> >                         if (pvec_pages && i == pvec_pages &&
>> >> > @@ -924,7 +923,10 @@ get_more_pages:
>> >> >
>> >> >                 osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0,
>> >> >                                                         !!pool, false);
>> >> > -
>> >> > +               for(j = 0; j < locked_pages; j++) {
>> >> > +                       struct page *page = pages[j];
>> >> > +                       ceph_readpage_to_fscache(inode, page);
>> >> > +               }
>> >> >                 pages = NULL;   /* request message now owns the pages array */
>> >> >                 pool = NULL;
>> >> >
>> >> > --
>> >> > 1.7.9.5
>> >> >
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>> >>
>> >>
>> >> --
>> >> Milosz Tanski
>> >> CTO
>> >> 10 East 53rd Street, 37th floor
>> >> New York, NY 10022
>> >>
>> >> p: 646-253-9055
>> >> e: milosz@adfin.com
>> >>
>> >>
>>
>>
>>
>> --
>> Milosz Tanski
>> CTO
>> 10 East 53rd Street, 37th floor
>> New York, NY 10022
>>
>> p: 646-253-9055
>> e: milosz@adfin.com
>>
>>

Milosz Tanski Nov. 19, 2013, 4:05 p.m. UTC | #6

Yan and Sage,

I've ran into this issue again on my test cluster. The client hangs
all requests for a particular inode, I did a dump cache to see what's
going... but I don't understand to enough to be able to read this line
well enough.

Can you guys help me read this, so I can further track down and
hopefully fix this issue.

[inode 10000346eed [2,head]
/petabucket/beta/17511b3d12466609785b6a0e34597431721d177240371c0a1a4e347a1605381b/advertiser_id.dict
auth v214 ap=5+0 dirtyparent s=925 n(v0 b925 1=1+0) (ifile sync->mix)
(iversion lock) cr={59947=0-4194304@1}
caps={59947=pAsLsXsFr/pAsxXsxFxwb@26,60001=pAsLsXsFr/-@1,60655=pAsLsXsFr/pAsLsXsFscr/pFscr@36}
| ptrwaiter=0 request=4 lock=1 caps=1 dirtyparent=1 dirty=1 waiter=1
authpin=1 0x17dd6b70]

root@bnode-16a1ed7d:~# cat
/sys/kernel/debug/ceph/e23a1bfc-8328-46bf-bc59-1209df3f5434.client60655/mdsc
15659 mds0 getattr #10000346eed
15679 mds0 getattr #10000346eed
15710 mds0 getattr #10000346eed
15922 mds0 getattr #10000346eed

On Wed, Nov 6, 2013 at 10:01 AM, Yan, Zheng <ukernel@gmail.com> wrote:
> On Wed, Nov 6, 2013 at 9:41 PM, Milosz Tanski <milosz@adfin.com> wrote:
>> Sage,
>>
>> I think the incrementing version counter on the whole is a neater
>> solution then using size and mtime. If nothing else it's more explicit
>> in the the read cache version. With what you suggested plus additional
>> changes to the open code (where the cookie gets created) the
>> write-through scenario should be correct.
>>
>> Sadly, my understanding of the MDS protocol is still not great. So
>> when doing this in the first place I erred on the side of using what
>> was already in place.
>>
>> In a kind of un-related question. Is there a debug hook in the kclient
>> (or MDS for that matter) to dump the current file inodes (names) with
>> issues caps and to which hosts. This would be very helpful for
>> debugging, since from time to time I see a one of the clients get
>> stuck in getattr (via mdsc debug log).
>>
>
> "ceph mds tell \* dumpcache" dump the mds cache to a file. the dump
> file contains caps information.
>
> Regards
> Yan, Zheng
>
>> Thanks,
>> - Milosz
>>
>> On Tue, Nov 5, 2013 at 6:56 PM, Sage Weil <sage@inktank.com> wrote:
>>> On Tue, 5 Nov 2013, Milosz Tanski wrote:
>>>> Li,
>>>>
>>>> First, sorry for the late reply on this.
>>>>
>>>> Currently fscache is only supported for files that are open in read
>>>> only mode. I originally was going to let fscache cache in the write
>>>> path as well as long as the file was open in with O_LAZY. I abandoned
>>>> that idea. When a user opens the file in O_LAZY we can cache things
>>>> locally with the assumption that the user will care of the
>>>> synchronization in some other manner. There is no way of invalidating
>>>> a subset of the pages in object cached by fscache, there is no way we
>>>> can make O_LAZY work well.
>>>>
>>>> The ceph_readpage_to_fscache() in writepage has no effect and it
>>>> should be removed. ceph_readpage_to_fscache() calls cache_valid() to
>>>> see if it should perform the page save, and since the file can't have
>>>> a CACHE cap at the point in time it doesn't do it.
>>>
>>> (Hmm, Dusting off my understanding of fscache and reading
>>> fs/ceph/cache.c; watch out!)  It looks like cache_valid is
>>>
>>> static inline int cache_valid(struct ceph_inode_info *ci)
>>> {
>>>         return ((ceph_caps_issued(ci) & CEPH_CAP_FILE_CACHE) &&
>>>                 (ci->i_fscache_gen == ci->i_rdcache_gen));
>>> }
>>>
>>> and in the FILE_EXCL case, the MDS will issue CACHE|BUFFER caps.  But I
>>> think the aux key (size+mtime) will prevent any use of the cache as soon
>>> as the first write happens and mtime changes, right?
>>>
>>> I think that in order to make this work, we need to fix/create a
>>> file_version (or something similar) field in the (mds) inode_t to have
>>> some useful value.  I.e., increment it any time
>>>
>>>  - a different client/writer comes along
>>>  - a file is modified by the mds (e.g., truncated or recovered)
>>>
>>> but allow it to otherwise remain the same as long as only a single client
>>> is working with the file exclusively.  This will be more precise than the
>>> (size, mtime) check that is currently used, and would remain valid when a
>>> single client opens the same file for exclusive read/write multiple times
>>> but there are no other intervening changes.
>>>
>>> Milosz, if that were in place, is there any reason not to wire up
>>> writepage and allow the fscache to be used write-through?
>>>
>>> sage
>>>
>>>
>>>
>>>
>>>>
>>>> Thanks,
>>>> - Milosz
>>>>
>>>> On Thu, Oct 31, 2013 at 11:56 PM, Li Wang <liwang@ubuntukylin.com> wrote:
>>>> > Currently, the pages in fscache only are updated in writepage() path,
>>>> > add the process in writepages().
>>>> >
>>>> > Signed-off-by: Min Chen <minchen@ubuntukylin.com>
>>>> > Signed-off-by: Li Wang <liwang@ubuntukylin.com>
>>>> > Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
>>>> > ---
>>>> >  fs/ceph/addr.c |    8 +++++---
>>>> >  1 file changed, 5 insertions(+), 3 deletions(-)
>>>> >
>>>> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>>>> > index 6df8bd4..cc57911 100644
>>>> > --- a/fs/ceph/addr.c
>>>> > +++ b/fs/ceph/addr.c
>>>> > @@ -746,7 +746,7 @@ retry:
>>>> >
>>>> >         while (!done && index <= end) {
>>>> >                 int num_ops = do_sync ? 2 : 1;
>>>> > -               unsigned i;
>>>> > +               unsigned i, j;
>>>> >                 int first;
>>>> >                 pgoff_t next;
>>>> >                 int pvec_pages, locked_pages;
>>>> > @@ -894,7 +894,6 @@ get_more_pages:
>>>> >                 if (!locked_pages)
>>>> >                         goto release_pvec_pages;
>>>> >                 if (i) {
>>>> > -                       int j;
>>>> >                         BUG_ON(!locked_pages || first < 0);
>>>> >
>>>> >                         if (pvec_pages && i == pvec_pages &&
>>>> > @@ -924,7 +923,10 @@ get_more_pages:
>>>> >
>>>> >                 osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0,
>>>> >                                                         !!pool, false);
>>>> > -
>>>> > +               for(j = 0; j < locked_pages; j++) {
>>>> > +                       struct page *page = pages[j];
>>>> > +                       ceph_readpage_to_fscache(inode, page);
>>>> > +               }
>>>> >                 pages = NULL;   /* request message now owns the pages array */
>>>> >                 pool = NULL;
>>>> >
>>>> > --
>>>> > 1.7.9.5
>>>> >
>>>> > --
>>>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> > the body of a message to majordomo@vger.kernel.org
>>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>> --
>>>> Milosz Tanski
>>>> CTO
>>>> 10 East 53rd Street, 37th floor
>>>> New York, NY 10022
>>>>
>>>> p: 646-253-9055
>>>> e: milosz@adfin.com
>>>>
>>>>
>>
>>
>>
>> --
>> Milosz Tanski
>> CTO
>> 10 East 53rd Street, 37th floor
>> New York, NY 10022
>>
>> p: 646-253-9055
>> e: milosz@adfin.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/

Yan, Zheng Nov. 20, 2013, 1:18 a.m. UTC | #7

On Wed, Nov 20, 2013 at 12:05 AM, Milosz Tanski <milosz@adfin.com> wrote:
> Yan and Sage,
>
> I've ran into this issue again on my test cluster. The client hangs
> all requests for a particular inode, I did a dump cache to see what's
> going... but I don't understand to enough to be able to read this line
> well enough.
>
> Can you guys help me read this, so I can further track down and
> hopefully fix this issue.
>
> [inode 10000346eed [2,head]
> /petabucket/beta/17511b3d12466609785b6a0e34597431721d177240371c0a1a4e347a1605381b/advertiser_id.dict
> auth v214 ap=5+0 dirtyparent s=925 n(v0 b925 1=1+0) (ifile sync->mix)
> (iversion lock) cr={59947=0-4194304@1}
> caps={59947=pAsLsXsFr/pAsxXsxFxwb@26,60001=pAsLsXsFr/-@1,60655=pAsLsXsFr/pAsLsXsFscr/pFscr@36}
> | ptrwaiter=0 request=4 lock=1 caps=1 dirtyparent=1 dirty=1 waiter=1
> authpin=1 0x17dd6b70]
>
> root@bnode-16a1ed7d:~# cat
> /sys/kernel/debug/ceph/e23a1bfc-8328-46bf-bc59-1209df3f5434.client60655/mdsc
> 15659 mds0 getattr #10000346eed
> 15679 mds0 getattr #10000346eed
> 15710 mds0 getattr #10000346eed
> 15922 mds0 getattr #10000346eed

which kernel do you use?  is there any blocked process (echo w >
/proc/sysrq-trigger) on client.60655 ? 3.12 kernel contains few fixes
for similar hang.

Regards
Yan, Zheng

>
> On Wed, Nov 6, 2013 at 10:01 AM, Yan, Zheng <ukernel@gmail.com> wrote:
>> On Wed, Nov 6, 2013 at 9:41 PM, Milosz Tanski <milosz@adfin.com> wrote:
>>> Sage,
>>>
>>> I think the incrementing version counter on the whole is a neater
>>> solution then using size and mtime. If nothing else it's more explicit
>>> in the the read cache version. With what you suggested plus additional
>>> changes to the open code (where the cookie gets created) the
>>> write-through scenario should be correct.
>>>
>>> Sadly, my understanding of the MDS protocol is still not great. So
>>> when doing this in the first place I erred on the side of using what
>>> was already in place.
>>>
>>> In a kind of un-related question. Is there a debug hook in the kclient
>>> (or MDS for that matter) to dump the current file inodes (names) with
>>> issues caps and to which hosts. This would be very helpful for
>>> debugging, since from time to time I see a one of the clients get
>>> stuck in getattr (via mdsc debug log).
>>>
>>
>> "ceph mds tell \* dumpcache" dump the mds cache to a file. the dump
>> file contains caps information.
>>
>> Regards
>> Yan, Zheng
>>
>>> Thanks,
>>> - Milosz
>>>
>>> On Tue, Nov 5, 2013 at 6:56 PM, Sage Weil <sage@inktank.com> wrote:
>>>> On Tue, 5 Nov 2013, Milosz Tanski wrote:
>>>>> Li,
>>>>>
>>>>> First, sorry for the late reply on this.
>>>>>
>>>>> Currently fscache is only supported for files that are open in read
>>>>> only mode. I originally was going to let fscache cache in the write
>>>>> path as well as long as the file was open in with O_LAZY. I abandoned
>>>>> that idea. When a user opens the file in O_LAZY we can cache things
>>>>> locally with the assumption that the user will care of the
>>>>> synchronization in some other manner. There is no way of invalidating
>>>>> a subset of the pages in object cached by fscache, there is no way we
>>>>> can make O_LAZY work well.
>>>>>
>>>>> The ceph_readpage_to_fscache() in writepage has no effect and it
>>>>> should be removed. ceph_readpage_to_fscache() calls cache_valid() to
>>>>> see if it should perform the page save, and since the file can't have
>>>>> a CACHE cap at the point in time it doesn't do it.
>>>>
>>>> (Hmm, Dusting off my understanding of fscache and reading
>>>> fs/ceph/cache.c; watch out!)  It looks like cache_valid is
>>>>
>>>> static inline int cache_valid(struct ceph_inode_info *ci)
>>>> {
>>>>         return ((ceph_caps_issued(ci) & CEPH_CAP_FILE_CACHE) &&
>>>>                 (ci->i_fscache_gen == ci->i_rdcache_gen));
>>>> }
>>>>
>>>> and in the FILE_EXCL case, the MDS will issue CACHE|BUFFER caps.  But I
>>>> think the aux key (size+mtime) will prevent any use of the cache as soon
>>>> as the first write happens and mtime changes, right?
>>>>
>>>> I think that in order to make this work, we need to fix/create a
>>>> file_version (or something similar) field in the (mds) inode_t to have
>>>> some useful value.  I.e., increment it any time
>>>>
>>>>  - a different client/writer comes along
>>>>  - a file is modified by the mds (e.g., truncated or recovered)
>>>>
>>>> but allow it to otherwise remain the same as long as only a single client
>>>> is working with the file exclusively.  This will be more precise than the
>>>> (size, mtime) check that is currently used, and would remain valid when a
>>>> single client opens the same file for exclusive read/write multiple times
>>>> but there are no other intervening changes.
>>>>
>>>> Milosz, if that were in place, is there any reason not to wire up
>>>> writepage and allow the fscache to be used write-through?
>>>>
>>>> sage
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> - Milosz
>>>>>
>>>>> On Thu, Oct 31, 2013 at 11:56 PM, Li Wang <liwang@ubuntukylin.com> wrote:
>>>>> > Currently, the pages in fscache only are updated in writepage() path,
>>>>> > add the process in writepages().
>>>>> >
>>>>> > Signed-off-by: Min Chen <minchen@ubuntukylin.com>
>>>>> > Signed-off-by: Li Wang <liwang@ubuntukylin.com>
>>>>> > Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
>>>>> > ---
>>>>> >  fs/ceph/addr.c |    8 +++++---
>>>>> >  1 file changed, 5 insertions(+), 3 deletions(-)
>>>>> >
>>>>> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>>>>> > index 6df8bd4..cc57911 100644
>>>>> > --- a/fs/ceph/addr.c
>>>>> > +++ b/fs/ceph/addr.c
>>>>> > @@ -746,7 +746,7 @@ retry:
>>>>> >
>>>>> >         while (!done && index <= end) {
>>>>> >                 int num_ops = do_sync ? 2 : 1;
>>>>> > -               unsigned i;
>>>>> > +               unsigned i, j;
>>>>> >                 int first;
>>>>> >                 pgoff_t next;
>>>>> >                 int pvec_pages, locked_pages;
>>>>> > @@ -894,7 +894,6 @@ get_more_pages:
>>>>> >                 if (!locked_pages)
>>>>> >                         goto release_pvec_pages;
>>>>> >                 if (i) {
>>>>> > -                       int j;
>>>>> >                         BUG_ON(!locked_pages || first < 0);
>>>>> >
>>>>> >                         if (pvec_pages && i == pvec_pages &&
>>>>> > @@ -924,7 +923,10 @@ get_more_pages:
>>>>> >
>>>>> >                 osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0,
>>>>> >                                                         !!pool, false);
>>>>> > -
>>>>> > +               for(j = 0; j < locked_pages; j++) {
>>>>> > +                       struct page *page = pages[j];
>>>>> > +                       ceph_readpage_to_fscache(inode, page);
>>>>> > +               }
>>>>> >                 pages = NULL;   /* request message now owns the pages array */
>>>>> >                 pool = NULL;
>>>>> >
>>>>> > --
>>>>> > 1.7.9.5
>>>>> >
>>>>> > --
>>>>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> > the body of a message to majordomo@vger.kernel.org
>>>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Milosz Tanski
>>>>> CTO
>>>>> 10 East 53rd Street, 37th floor
>>>>> New York, NY 10022
>>>>>
>>>>> p: 646-253-9055
>>>>> e: milosz@adfin.com
>>>>>
>>>>>
>>>
>>>
>>>
>>> --
>>> Milosz Tanski
>>> CTO
>>> 10 East 53rd Street, 37th floor
>>> New York, NY 10022
>>>
>>> p: 646-253-9055
>>> e: milosz@adfin.com
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>
>
>
> --
> Milosz Tanski
> CTO
> 10 East 53rd Street, 37th floor
> New York, NY 10022
>
> p: 646-253-9055
> e: milosz@adfin.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Milosz Tanski Nov. 20, 2013, 2:30 a.m. UTC | #8

Yan,

I'll use this trick next time around. I did dump the kernel stacks for
my process. 4 threads were blocked on SYS_newfstat (and the mds
request further up the stack).

I ended up restarting MDS after a few hours of trying to track it
down. It resolved it self following that.

This machine is running a pretty recent kernel on there -- 3.12 +
merged testing ceph -- the other machines are running a slightly older
3.12-rc. I have observed the issue previously on random nodes
infrequently for months (maybe once a week or two).

Thanks again,
- Milosz

On Tue, Nov 19, 2013 at 8:18 PM, Yan, Zheng <ukernel@gmail.com> wrote:
> On Wed, Nov 20, 2013 at 12:05 AM, Milosz Tanski <milosz@adfin.com> wrote:
>> Yan and Sage,
>>
>> I've ran into this issue again on my test cluster. The client hangs
>> all requests for a particular inode, I did a dump cache to see what's
>> going... but I don't understand to enough to be able to read this line
>> well enough.
>>
>> Can you guys help me read this, so I can further track down and
>> hopefully fix this issue.
>>
>> [inode 10000346eed [2,head]
>> /petabucket/beta/17511b3d12466609785b6a0e34597431721d177240371c0a1a4e347a1605381b/advertiser_id.dict
>> auth v214 ap=5+0 dirtyparent s=925 n(v0 b925 1=1+0) (ifile sync->mix)
>> (iversion lock) cr={59947=0-4194304@1}
>> caps={59947=pAsLsXsFr/pAsxXsxFxwb@26,60001=pAsLsXsFr/-@1,60655=pAsLsXsFr/pAsLsXsFscr/pFscr@36}
>> | ptrwaiter=0 request=4 lock=1 caps=1 dirtyparent=1 dirty=1 waiter=1
>> authpin=1 0x17dd6b70]
>>
>> root@bnode-16a1ed7d:~# cat
>> /sys/kernel/debug/ceph/e23a1bfc-8328-46bf-bc59-1209df3f5434.client60655/mdsc
>> 15659 mds0 getattr #10000346eed
>> 15679 mds0 getattr #10000346eed
>> 15710 mds0 getattr #10000346eed
>> 15922 mds0 getattr #10000346eed
>
> which kernel do you use?  is there any blocked process (echo w >
> /proc/sysrq-trigger) on client.60655 ? 3.12 kernel contains few fixes
> for similar hang.
>
> Regards
> Yan, Zheng
>
>>
>> On Wed, Nov 6, 2013 at 10:01 AM, Yan, Zheng <ukernel@gmail.com> wrote:
>>> On Wed, Nov 6, 2013 at 9:41 PM, Milosz Tanski <milosz@adfin.com> wrote:
>>>> Sage,
>>>>
>>>> I think the incrementing version counter on the whole is a neater
>>>> solution then using size and mtime. If nothing else it's more explicit
>>>> in the the read cache version. With what you suggested plus additional
>>>> changes to the open code (where the cookie gets created) the
>>>> write-through scenario should be correct.
>>>>
>>>> Sadly, my understanding of the MDS protocol is still not great. So
>>>> when doing this in the first place I erred on the side of using what
>>>> was already in place.
>>>>
>>>> In a kind of un-related question. Is there a debug hook in the kclient
>>>> (or MDS for that matter) to dump the current file inodes (names) with
>>>> issues caps and to which hosts. This would be very helpful for
>>>> debugging, since from time to time I see a one of the clients get
>>>> stuck in getattr (via mdsc debug log).
>>>>
>>>
>>> "ceph mds tell \* dumpcache" dump the mds cache to a file. the dump
>>> file contains caps information.
>>>
>>> Regards
>>> Yan, Zheng
>>>
>>>> Thanks,
>>>> - Milosz
>>>>
>>>> On Tue, Nov 5, 2013 at 6:56 PM, Sage Weil <sage@inktank.com> wrote:
>>>>> On Tue, 5 Nov 2013, Milosz Tanski wrote:
>>>>>> Li,
>>>>>>
>>>>>> First, sorry for the late reply on this.
>>>>>>
>>>>>> Currently fscache is only supported for files that are open in read
>>>>>> only mode. I originally was going to let fscache cache in the write
>>>>>> path as well as long as the file was open in with O_LAZY. I abandoned
>>>>>> that idea. When a user opens the file in O_LAZY we can cache things
>>>>>> locally with the assumption that the user will care of the
>>>>>> synchronization in some other manner. There is no way of invalidating
>>>>>> a subset of the pages in object cached by fscache, there is no way we
>>>>>> can make O_LAZY work well.
>>>>>>
>>>>>> The ceph_readpage_to_fscache() in writepage has no effect and it
>>>>>> should be removed. ceph_readpage_to_fscache() calls cache_valid() to
>>>>>> see if it should perform the page save, and since the file can't have
>>>>>> a CACHE cap at the point in time it doesn't do it.
>>>>>
>>>>> (Hmm, Dusting off my understanding of fscache and reading
>>>>> fs/ceph/cache.c; watch out!)  It looks like cache_valid is
>>>>>
>>>>> static inline int cache_valid(struct ceph_inode_info *ci)
>>>>> {
>>>>>         return ((ceph_caps_issued(ci) & CEPH_CAP_FILE_CACHE) &&
>>>>>                 (ci->i_fscache_gen == ci->i_rdcache_gen));
>>>>> }
>>>>>
>>>>> and in the FILE_EXCL case, the MDS will issue CACHE|BUFFER caps.  But I
>>>>> think the aux key (size+mtime) will prevent any use of the cache as soon
>>>>> as the first write happens and mtime changes, right?
>>>>>
>>>>> I think that in order to make this work, we need to fix/create a
>>>>> file_version (or something similar) field in the (mds) inode_t to have
>>>>> some useful value.  I.e., increment it any time
>>>>>
>>>>>  - a different client/writer comes along
>>>>>  - a file is modified by the mds (e.g., truncated or recovered)
>>>>>
>>>>> but allow it to otherwise remain the same as long as only a single client
>>>>> is working with the file exclusively.  This will be more precise than the
>>>>> (size, mtime) check that is currently used, and would remain valid when a
>>>>> single client opens the same file for exclusive read/write multiple times
>>>>> but there are no other intervening changes.
>>>>>
>>>>> Milosz, if that were in place, is there any reason not to wire up
>>>>> writepage and allow the fscache to be used write-through?
>>>>>
>>>>> sage
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> - Milosz
>>>>>>
>>>>>> On Thu, Oct 31, 2013 at 11:56 PM, Li Wang <liwang@ubuntukylin.com> wrote:
>>>>>> > Currently, the pages in fscache only are updated in writepage() path,
>>>>>> > add the process in writepages().
>>>>>> >
>>>>>> > Signed-off-by: Min Chen <minchen@ubuntukylin.com>
>>>>>> > Signed-off-by: Li Wang <liwang@ubuntukylin.com>
>>>>>> > Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
>>>>>> > ---
>>>>>> >  fs/ceph/addr.c |    8 +++++---
>>>>>> >  1 file changed, 5 insertions(+), 3 deletions(-)
>>>>>> >
>>>>>> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>>>>>> > index 6df8bd4..cc57911 100644
>>>>>> > --- a/fs/ceph/addr.c
>>>>>> > +++ b/fs/ceph/addr.c
>>>>>> > @@ -746,7 +746,7 @@ retry:
>>>>>> >
>>>>>> >         while (!done && index <= end) {
>>>>>> >                 int num_ops = do_sync ? 2 : 1;
>>>>>> > -               unsigned i;
>>>>>> > +               unsigned i, j;
>>>>>> >                 int first;
>>>>>> >                 pgoff_t next;
>>>>>> >                 int pvec_pages, locked_pages;
>>>>>> > @@ -894,7 +894,6 @@ get_more_pages:
>>>>>> >                 if (!locked_pages)
>>>>>> >                         goto release_pvec_pages;
>>>>>> >                 if (i) {
>>>>>> > -                       int j;
>>>>>> >                         BUG_ON(!locked_pages || first < 0);
>>>>>> >
>>>>>> >                         if (pvec_pages && i == pvec_pages &&
>>>>>> > @@ -924,7 +923,10 @@ get_more_pages:
>>>>>> >
>>>>>> >                 osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0,
>>>>>> >                                                         !!pool, false);
>>>>>> > -
>>>>>> > +               for(j = 0; j < locked_pages; j++) {
>>>>>> > +                       struct page *page = pages[j];
>>>>>> > +                       ceph_readpage_to_fscache(inode, page);
>>>>>> > +               }
>>>>>> >                 pages = NULL;   /* request message now owns the pages array */
>>>>>> >                 pool = NULL;
>>>>>> >
>>>>>> > --
>>>>>> > 1.7.9.5
>>>>>> >
>>>>>> > --
>>>>>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> > the body of a message to majordomo@vger.kernel.org
>>>>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Milosz Tanski
>>>>>> CTO
>>>>>> 10 East 53rd Street, 37th floor
>>>>>> New York, NY 10022
>>>>>>
>>>>>> p: 646-253-9055
>>>>>> e: milosz@adfin.com
>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Milosz Tanski
>>>> CTO
>>>> 10 East 53rd Street, 37th floor
>>>> New York, NY 10022
>>>>
>>>> p: 646-253-9055
>>>> e: milosz@adfin.com
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>>
>>
>> --
>> Milosz Tanski
>> CTO
>> 10 East 53rd Street, 37th floor
>> New York, NY 10022
>>
>> p: 646-253-9055
>> e: milosz@adfin.com

ceph: Update the pages in fscache in writepages() path

Commit Message

Comments

Patch