diff mbox

[6/6] ceph: don't acquire i_mutex ceph_vmtruncate_work

Message ID 1357299261-20591-7-git-send-email-zheng.z.yan@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yan, Zheng Jan. 4, 2013, 11:34 a.m. UTC
From: "Yan, Zheng" <zheng.z.yan@intel.com>

In commit 22cddde104, ceph_get_caps() was moved into ceph_write_begin().
So ceph_get_caps() can be called while i_mutex is locked. If there is
pending vmtruncate, ceph_get_caps() will wait for it to complete, but
ceph_vmtruncate_work() is blocked by the locked i_mutex.

There are several places that call __ceph_do_pending_vmtruncate()
without holding the i_mutex, I think it's OK to not acquire the i_mutex
in ceph_vmtruncate_work()

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 fs/ceph/inode.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

Comments

Sage Weil Jan. 6, 2013, 6 a.m. UTC | #1
On Fri, 4 Jan 2013, Yan, Zheng wrote:
> From: "Yan, Zheng" <zheng.z.yan@intel.com>
> 
> In commit 22cddde104, ceph_get_caps() was moved into ceph_write_begin().
> So ceph_get_caps() can be called while i_mutex is locked. If there is
> pending vmtruncate, ceph_get_caps() will wait for it to complete, but
> ceph_vmtruncate_work() is blocked by the locked i_mutex.

Hmm... :/

> There are several places that call __ceph_do_pending_vmtruncate()
> without holding the i_mutex, I think it's OK to not acquire the i_mutex
> in ceph_vmtruncate_work()

The intention was that that function woudl only be called under i_mutex.  
I did a quick look through the callers and that appears to be the case 
(for things llseek and setattr, the vfs should be taking i_mutex).

IIRC, this is to serialize the page cache truncation with truncate 
operations; this work can only be sanely done under i_mutex, so we defer 
it to the work queue or next person who takes i_mutex and cares about the 
mapping and i_size being consistent.

What was the deadlock you observed?

sage


> 
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
>  fs/ceph/inode.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 81613bc..c895c7f 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -1420,7 +1420,7 @@ out:
>  
>  
>  /*
> - * called by trunc_wq; take i_mutex ourselves
> + * called by trunc_wq;
>   *
>   * We also truncate in a separate thread as well.
>   */
> @@ -1431,9 +1431,7 @@ static void ceph_vmtruncate_work(struct work_struct *work)
>  	struct inode *inode = &ci->vfs_inode;
>  
>  	dout("vmtruncate_work %p\n", inode);
> -	mutex_lock(&inode->i_mutex);
>  	__ceph_do_pending_vmtruncate(inode);
> -	mutex_unlock(&inode->i_mutex);
>  	iput(inode);
>  }
>  
> -- 
> 1.7.11.7
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yan, Zheng Jan. 6, 2013, 7:54 a.m. UTC | #2
On 01/06/2013 02:00 PM, Sage Weil wrote:
> On Fri, 4 Jan 2013, Yan, Zheng wrote:
>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>>
>> In commit 22cddde104, ceph_get_caps() was moved into ceph_write_begin().
>> So ceph_get_caps() can be called while i_mutex is locked. If there is
>> pending vmtruncate, ceph_get_caps() will wait for it to complete, but
>> ceph_vmtruncate_work() is blocked by the locked i_mutex.
> 
> Hmm... :/
> 
>> There are several places that call __ceph_do_pending_vmtruncate()
>> without holding the i_mutex, I think it's OK to not acquire the i_mutex
>> in ceph_vmtruncate_work()
> 
> The intention was that that function woudl only be called under i_mutex.  
> I did a quick look through the callers and that appears to be the case 
> (for things llseek and setattr, the vfs should be taking i_mutex).

both ceph_aio_read() and ceph_aio_write() call __ceph_do_pending_vmtruncate()
without holding the i_mutex

> 
> IIRC, this is to serialize the page cache truncation with truncate 
> operations; this work can only be sanely done under i_mutex, so we defer 
> it to the work queue or next person who takes i_mutex and cares about the 
> mapping and i_size being consistent.
> 
> What was the deadlock you observed?

generic_file_aio_write() locks the i_mutex, then indirectly calls ceph_get_caps()
through ceph_write_begin(). ceph_get_caps() wait for pending vmtruncate to complete,
but the work queue is blocked by the i_mutex.

Regards
Yan, Zheng

> 
> sage
> 
> 
>>
>> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
>> ---
>>  fs/ceph/inode.c | 4 +---
>>  1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
>> index 81613bc..c895c7f 100644
>> --- a/fs/ceph/inode.c
>> +++ b/fs/ceph/inode.c
>> @@ -1420,7 +1420,7 @@ out:
>>  
>>  
>>  /*
>> - * called by trunc_wq; take i_mutex ourselves
>> + * called by trunc_wq;
>>   *
>>   * We also truncate in a separate thread as well.
>>   */
>> @@ -1431,9 +1431,7 @@ static void ceph_vmtruncate_work(struct work_struct *work)
>>  	struct inode *inode = &ci->vfs_inode;
>>  
>>  	dout("vmtruncate_work %p\n", inode);
>> -	mutex_lock(&inode->i_mutex);
>>  	__ceph_do_pending_vmtruncate(inode);
>> -	mutex_unlock(&inode->i_mutex);
>>  	iput(inode);
>>  }
>>  
>> -- 
>> 1.7.11.7
>>
>>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Jan. 7, 2013, 5:05 a.m. UTC | #3
On Sun, 6 Jan 2013, Yan, Zheng wrote:
> On 01/06/2013 02:00 PM, Sage Weil wrote:
> > On Fri, 4 Jan 2013, Yan, Zheng wrote:
> >> From: "Yan, Zheng" <zheng.z.yan@intel.com>
> >>
> >> In commit 22cddde104, ceph_get_caps() was moved into ceph_write_begin().
> >> So ceph_get_caps() can be called while i_mutex is locked. If there is
> >> pending vmtruncate, ceph_get_caps() will wait for it to complete, but
> >> ceph_vmtruncate_work() is blocked by the locked i_mutex.
> > 
> > Hmm... :/
> > 
> >> There are several places that call __ceph_do_pending_vmtruncate()
> >> without holding the i_mutex, I think it's OK to not acquire the i_mutex
> >> in ceph_vmtruncate_work()
> > 
> > The intention was that that function woudl only be called under i_mutex.  
> > I did a quick look through the callers and that appears to be the case 
> > (for things llseek and setattr, the vfs should be taking i_mutex).
> 
> both ceph_aio_read() and ceph_aio_write() call __ceph_do_pending_vmtruncate()
> without holding the i_mutex

Hrm.. that's now how I remember it working.  I'm pretty sure i_mutex was 
providing the serialization around truncation.

> > IIRC, this is to serialize the page cache truncation with truncate 
> > operations; this work can only be sanely done under i_mutex, so we defer 
> > it to the work queue or next person who takes i_mutex and cares about the 
> > mapping and i_size being consistent.
> > 
> > What was the deadlock you observed?
> 
> generic_file_aio_write() locks the i_mutex, then indirectly calls ceph_get_caps()
> through ceph_write_begin(). ceph_get_caps() wait for pending vmtruncate to complete,
> but the work queue is blocked by the i_mutex.

...but it seems clear that that's not a workable solution.  I suspect the 
right solution here is to introduce an inner i_trunc_mutex in the ceph 
inode and use that to protect truncation events in the page cache.  That 
will touch a lot of different code paths, unfortunately, but I think 
something like that is necessary if we need to block with i_mutex held by 
the VFS.

sage


> 
> Regards
> Yan, Zheng
> 
> > 
> > sage
> > 
> > 
> >>
> >> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> >> ---
> >>  fs/ceph/inode.c | 4 +---
> >>  1 file changed, 1 insertion(+), 3 deletions(-)
> >>
> >> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> >> index 81613bc..c895c7f 100644
> >> --- a/fs/ceph/inode.c
> >> +++ b/fs/ceph/inode.c
> >> @@ -1420,7 +1420,7 @@ out:
> >>  
> >>  
> >>  /*
> >> - * called by trunc_wq; take i_mutex ourselves
> >> + * called by trunc_wq;
> >>   *
> >>   * We also truncate in a separate thread as well.
> >>   */
> >> @@ -1431,9 +1431,7 @@ static void ceph_vmtruncate_work(struct work_struct *work)
> >>  	struct inode *inode = &ci->vfs_inode;
> >>  
> >>  	dout("vmtruncate_work %p\n", inode);
> >> -	mutex_lock(&inode->i_mutex);
> >>  	__ceph_do_pending_vmtruncate(inode);
> >> -	mutex_unlock(&inode->i_mutex);
> >>  	iput(inode);
> >>  }
> >>  
> >> -- 
> >> 1.7.11.7
> >>
> >>
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yan, Zheng Jan. 7, 2013, 5:31 a.m. UTC | #4
On 01/07/2013 01:05 PM, Sage Weil wrote:
> On Sun, 6 Jan 2013, Yan, Zheng wrote:
>> On 01/06/2013 02:00 PM, Sage Weil wrote:
>>> On Fri, 4 Jan 2013, Yan, Zheng wrote:
>>>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>>>>
>>>> In commit 22cddde104, ceph_get_caps() was moved into ceph_write_begin().
>>>> So ceph_get_caps() can be called while i_mutex is locked. If there is
>>>> pending vmtruncate, ceph_get_caps() will wait for it to complete, but
>>>> ceph_vmtruncate_work() is blocked by the locked i_mutex.
>>>
>>> Hmm... :/
>>>
>>>> There are several places that call __ceph_do_pending_vmtruncate()
>>>> without holding the i_mutex, I think it's OK to not acquire the i_mutex
>>>> in ceph_vmtruncate_work()
>>>
>>> The intention was that that function woudl only be called under i_mutex.  
>>> I did a quick look through the callers and that appears to be the case 
>>> (for things llseek and setattr, the vfs should be taking i_mutex).
>>
>> both ceph_aio_read() and ceph_aio_write() call __ceph_do_pending_vmtruncate()
>> without holding the i_mutex
> 
> Hrm.. that's now how I remember it working.  I'm pretty sure i_mutex was 
> providing the serialization around truncation.
> 
>>> IIRC, this is to serialize the page cache truncation with truncate 
>>> operations; this work can only be sanely done under i_mutex, so we defer 
>>> it to the work queue or next person who takes i_mutex and cares about the 
>>> mapping and i_size being consistent.
>>>
>>> What was the deadlock you observed?
>>
>> generic_file_aio_write() locks the i_mutex, then indirectly calls ceph_get_caps()
>> through ceph_write_begin(). ceph_get_caps() wait for pending vmtruncate to complete,
>> but the work queue is blocked by the i_mutex.
> 
> ...but it seems clear that that's not a workable solution.  I suspect the 
> right solution here is to introduce an inner i_trunc_mutex in the ceph 
> inode and use that to protect truncation events in the page cache.  That 
> will touch a lot of different code paths, unfortunately, but I think 
> something like that is necessary if we need to block with i_mutex held by 
> the VFS.
> 

Maybe ceph_vmtruncate_work() can wait until no one is using write cap. I think it's safe to
truncate page cache in that case.

Regards
Yan, Zheng

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 81613bc..c895c7f 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1420,7 +1420,7 @@  out:
 
 
 /*
- * called by trunc_wq; take i_mutex ourselves
+ * called by trunc_wq;
  *
  * We also truncate in a separate thread as well.
  */
@@ -1431,9 +1431,7 @@  static void ceph_vmtruncate_work(struct work_struct *work)
 	struct inode *inode = &ci->vfs_inode;
 
 	dout("vmtruncate_work %p\n", inode);
-	mutex_lock(&inode->i_mutex);
 	__ceph_do_pending_vmtruncate(inode);
-	mutex_unlock(&inode->i_mutex);
 	iput(inode);
 }