Message ID | 1357299261-20591-7-git-send-email-zheng.z.yan@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, 4 Jan 2013, Yan, Zheng wrote: > From: "Yan, Zheng" <zheng.z.yan@intel.com> > > In commit 22cddde104, ceph_get_caps() was moved into ceph_write_begin(). > So ceph_get_caps() can be called while i_mutex is locked. If there is > pending vmtruncate, ceph_get_caps() will wait for it to complete, but > ceph_vmtruncate_work() is blocked by the locked i_mutex. Hmm... :/ > There are several places that call __ceph_do_pending_vmtruncate() > without holding the i_mutex, I think it's OK to not acquire the i_mutex > in ceph_vmtruncate_work() The intention was that that function woudl only be called under i_mutex. I did a quick look through the callers and that appears to be the case (for things llseek and setattr, the vfs should be taking i_mutex). IIRC, this is to serialize the page cache truncation with truncate operations; this work can only be sanely done under i_mutex, so we defer it to the work queue or next person who takes i_mutex and cares about the mapping and i_size being consistent. What was the deadlock you observed? sage > > Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> > --- > fs/ceph/inode.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c > index 81613bc..c895c7f 100644 > --- a/fs/ceph/inode.c > +++ b/fs/ceph/inode.c > @@ -1420,7 +1420,7 @@ out: > > > /* > - * called by trunc_wq; take i_mutex ourselves > + * called by trunc_wq; > * > * We also truncate in a separate thread as well. > */ > @@ -1431,9 +1431,7 @@ static void ceph_vmtruncate_work(struct work_struct *work) > struct inode *inode = &ci->vfs_inode; > > dout("vmtruncate_work %p\n", inode); > - mutex_lock(&inode->i_mutex); > __ceph_do_pending_vmtruncate(inode); > - mutex_unlock(&inode->i_mutex); > iput(inode); > } > > -- > 1.7.11.7 > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/06/2013 02:00 PM, Sage Weil wrote: > On Fri, 4 Jan 2013, Yan, Zheng wrote: >> From: "Yan, Zheng" <zheng.z.yan@intel.com> >> >> In commit 22cddde104, ceph_get_caps() was moved into ceph_write_begin(). >> So ceph_get_caps() can be called while i_mutex is locked. If there is >> pending vmtruncate, ceph_get_caps() will wait for it to complete, but >> ceph_vmtruncate_work() is blocked by the locked i_mutex. > > Hmm... :/ > >> There are several places that call __ceph_do_pending_vmtruncate() >> without holding the i_mutex, I think it's OK to not acquire the i_mutex >> in ceph_vmtruncate_work() > > The intention was that that function woudl only be called under i_mutex. > I did a quick look through the callers and that appears to be the case > (for things llseek and setattr, the vfs should be taking i_mutex). both ceph_aio_read() and ceph_aio_write() call __ceph_do_pending_vmtruncate() without holding the i_mutex > > IIRC, this is to serialize the page cache truncation with truncate > operations; this work can only be sanely done under i_mutex, so we defer > it to the work queue or next person who takes i_mutex and cares about the > mapping and i_size being consistent. > > What was the deadlock you observed? generic_file_aio_write() locks the i_mutex, then indirectly calls ceph_get_caps() through ceph_write_begin(). ceph_get_caps() wait for pending vmtruncate to complete, but the work queue is blocked by the i_mutex. Regards Yan, Zheng > > sage > > >> >> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> >> --- >> fs/ceph/inode.c | 4 +--- >> 1 file changed, 1 insertion(+), 3 deletions(-) >> >> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c >> index 81613bc..c895c7f 100644 >> --- a/fs/ceph/inode.c >> +++ b/fs/ceph/inode.c >> @@ -1420,7 +1420,7 @@ out: >> >> >> /* >> - * called by trunc_wq; take i_mutex ourselves >> + * called by trunc_wq; >> * >> * We also truncate in a separate thread as well. >> */ >> @@ -1431,9 +1431,7 @@ static void ceph_vmtruncate_work(struct work_struct *work) >> struct inode *inode = &ci->vfs_inode; >> >> dout("vmtruncate_work %p\n", inode); >> - mutex_lock(&inode->i_mutex); >> __ceph_do_pending_vmtruncate(inode); >> - mutex_unlock(&inode->i_mutex); >> iput(inode); >> } >> >> -- >> 1.7.11.7 >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 6 Jan 2013, Yan, Zheng wrote: > On 01/06/2013 02:00 PM, Sage Weil wrote: > > On Fri, 4 Jan 2013, Yan, Zheng wrote: > >> From: "Yan, Zheng" <zheng.z.yan@intel.com> > >> > >> In commit 22cddde104, ceph_get_caps() was moved into ceph_write_begin(). > >> So ceph_get_caps() can be called while i_mutex is locked. If there is > >> pending vmtruncate, ceph_get_caps() will wait for it to complete, but > >> ceph_vmtruncate_work() is blocked by the locked i_mutex. > > > > Hmm... :/ > > > >> There are several places that call __ceph_do_pending_vmtruncate() > >> without holding the i_mutex, I think it's OK to not acquire the i_mutex > >> in ceph_vmtruncate_work() > > > > The intention was that that function woudl only be called under i_mutex. > > I did a quick look through the callers and that appears to be the case > > (for things llseek and setattr, the vfs should be taking i_mutex). > > both ceph_aio_read() and ceph_aio_write() call __ceph_do_pending_vmtruncate() > without holding the i_mutex Hrm.. that's now how I remember it working. I'm pretty sure i_mutex was providing the serialization around truncation. > > IIRC, this is to serialize the page cache truncation with truncate > > operations; this work can only be sanely done under i_mutex, so we defer > > it to the work queue or next person who takes i_mutex and cares about the > > mapping and i_size being consistent. > > > > What was the deadlock you observed? > > generic_file_aio_write() locks the i_mutex, then indirectly calls ceph_get_caps() > through ceph_write_begin(). ceph_get_caps() wait for pending vmtruncate to complete, > but the work queue is blocked by the i_mutex. ...but it seems clear that that's not a workable solution. I suspect the right solution here is to introduce an inner i_trunc_mutex in the ceph inode and use that to protect truncation events in the page cache. That will touch a lot of different code paths, unfortunately, but I think something like that is necessary if we need to block with i_mutex held by the VFS. sage > > Regards > Yan, Zheng > > > > > sage > > > > > >> > >> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> > >> --- > >> fs/ceph/inode.c | 4 +--- > >> 1 file changed, 1 insertion(+), 3 deletions(-) > >> > >> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c > >> index 81613bc..c895c7f 100644 > >> --- a/fs/ceph/inode.c > >> +++ b/fs/ceph/inode.c > >> @@ -1420,7 +1420,7 @@ out: > >> > >> > >> /* > >> - * called by trunc_wq; take i_mutex ourselves > >> + * called by trunc_wq; > >> * > >> * We also truncate in a separate thread as well. > >> */ > >> @@ -1431,9 +1431,7 @@ static void ceph_vmtruncate_work(struct work_struct *work) > >> struct inode *inode = &ci->vfs_inode; > >> > >> dout("vmtruncate_work %p\n", inode); > >> - mutex_lock(&inode->i_mutex); > >> __ceph_do_pending_vmtruncate(inode); > >> - mutex_unlock(&inode->i_mutex); > >> iput(inode); > >> } > >> > >> -- > >> 1.7.11.7 > >> > >> > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/07/2013 01:05 PM, Sage Weil wrote: > On Sun, 6 Jan 2013, Yan, Zheng wrote: >> On 01/06/2013 02:00 PM, Sage Weil wrote: >>> On Fri, 4 Jan 2013, Yan, Zheng wrote: >>>> From: "Yan, Zheng" <zheng.z.yan@intel.com> >>>> >>>> In commit 22cddde104, ceph_get_caps() was moved into ceph_write_begin(). >>>> So ceph_get_caps() can be called while i_mutex is locked. If there is >>>> pending vmtruncate, ceph_get_caps() will wait for it to complete, but >>>> ceph_vmtruncate_work() is blocked by the locked i_mutex. >>> >>> Hmm... :/ >>> >>>> There are several places that call __ceph_do_pending_vmtruncate() >>>> without holding the i_mutex, I think it's OK to not acquire the i_mutex >>>> in ceph_vmtruncate_work() >>> >>> The intention was that that function woudl only be called under i_mutex. >>> I did a quick look through the callers and that appears to be the case >>> (for things llseek and setattr, the vfs should be taking i_mutex). >> >> both ceph_aio_read() and ceph_aio_write() call __ceph_do_pending_vmtruncate() >> without holding the i_mutex > > Hrm.. that's now how I remember it working. I'm pretty sure i_mutex was > providing the serialization around truncation. > >>> IIRC, this is to serialize the page cache truncation with truncate >>> operations; this work can only be sanely done under i_mutex, so we defer >>> it to the work queue or next person who takes i_mutex and cares about the >>> mapping and i_size being consistent. >>> >>> What was the deadlock you observed? >> >> generic_file_aio_write() locks the i_mutex, then indirectly calls ceph_get_caps() >> through ceph_write_begin(). ceph_get_caps() wait for pending vmtruncate to complete, >> but the work queue is blocked by the i_mutex. > > ...but it seems clear that that's not a workable solution. I suspect the > right solution here is to introduce an inner i_trunc_mutex in the ceph > inode and use that to protect truncation events in the page cache. That > will touch a lot of different code paths, unfortunately, but I think > something like that is necessary if we need to block with i_mutex held by > the VFS. > Maybe ceph_vmtruncate_work() can wait until no one is using write cap. I think it's safe to truncate page cache in that case. Regards Yan, Zheng -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 81613bc..c895c7f 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -1420,7 +1420,7 @@ out: /* - * called by trunc_wq; take i_mutex ourselves + * called by trunc_wq; * * We also truncate in a separate thread as well. */ @@ -1431,9 +1431,7 @@ static void ceph_vmtruncate_work(struct work_struct *work) struct inode *inode = &ci->vfs_inode; dout("vmtruncate_work %p\n", inode); - mutex_lock(&inode->i_mutex); __ceph_do_pending_vmtruncate(inode); - mutex_unlock(&inode->i_mutex); iput(inode); }