Message ID | 20140319151252.16ed3ac6@gandalf.local.home (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Mar 19, 2014 at 03:12:52PM -0400, Steven Rostedt wrote: > My question to Tejun is, if we create another workqueue, to add the > rdata->work to, would that prevent the above problem? Or what other > fixes can we do? The way I understand workqueues is that we cannot guarantee concurrency like this. It tries, but there's no guarantee. WQ_MAX_ACTIVE seems to be a hard upper limit of concurrent workers. So given 511 other blocked works, the described problem will always happen. Creating another workqueue doesn't actually create more threads. There is the kthread_work stuff for if you want a guaranteed worker thread. -- To unsubscribe from this list: send the line "unsubscribe linux-cifs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 19 Mar 2014 20:34:07 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Mar 19, 2014 at 03:12:52PM -0400, Steven Rostedt wrote: > > My question to Tejun is, if we create another workqueue, to add the > > rdata->work to, would that prevent the above problem? Or what other > > fixes can we do? > > The way I understand workqueues is that we cannot guarantee concurrency > like this. It tries, but there's no guarantee. > > WQ_MAX_ACTIVE seems to be a hard upper limit of concurrent workers. So > given 511 other blocked works, the described problem will always happen. > > Creating another workqueue doesn't actually create more threads. But I noticed this: Before patch: # ps aux |grep cifs root 3119 0.0 0.0 0 0 ? S< 14:17 0:00 [cifsiod] After patch: # ps aux |grep cifs root 1109 0.0 0.0 0 0 ? S< 15:11 0:00 [cifsiod] root 1111 0.0 0.0 0 0 ? S< 15:11 0:00 [cifsiord] It looks to me that it does create new threads. -- Steve > > There is the kthread_work stuff for if you want a guaranteed worker > thread. -- To unsubscribe from this list: send the line "unsubscribe linux-cifs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 19 Mar 2014 15:43:39 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > On Wed, 19 Mar 2014 20:34:07 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Wed, Mar 19, 2014 at 03:12:52PM -0400, Steven Rostedt wrote: > > > My question to Tejun is, if we create another workqueue, to add the > > > rdata->work to, would that prevent the above problem? Or what other > > > fixes can we do? > > > > The way I understand workqueues is that we cannot guarantee concurrency > > like this. It tries, but there's no guarantee. > > > > WQ_MAX_ACTIVE seems to be a hard upper limit of concurrent workers. So > > given 511 other blocked works, the described problem will always happen. > > > > Creating another workqueue doesn't actually create more threads. > > But I noticed this: > > Before patch: > > # ps aux |grep cifs > root 3119 0.0 0.0 0 0 ? S< 14:17 0:00 [cifsiod] > > After patch: > > # ps aux |grep cifs > root 1109 0.0 0.0 0 0 ? S< 15:11 0:00 [cifsiod] > root 1111 0.0 0.0 0 0 ? S< 15:11 0:00 [cifsiord] > > It looks to me that it does create new threads. > Or is that just the rescuer thread? I can rewrite the patch to use kthread_work instead too. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-cifs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 19, 2014 at 03:43:39PM -0400, Steven Rostedt wrote: > On Wed, 19 Mar 2014 20:34:07 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Wed, Mar 19, 2014 at 03:12:52PM -0400, Steven Rostedt wrote: > > > My question to Tejun is, if we create another workqueue, to add the > > > rdata->work to, would that prevent the above problem? Or what other > > > fixes can we do? > > > > The way I understand workqueues is that we cannot guarantee concurrency > > like this. It tries, but there's no guarantee. > > > > WQ_MAX_ACTIVE seems to be a hard upper limit of concurrent workers. So > > given 511 other blocked works, the described problem will always happen. > > > > Creating another workqueue doesn't actually create more threads. > > But I noticed this: > > Before patch: > > # ps aux |grep cifs > root 3119 0.0 0.0 0 0 ? S< 14:17 0:00 [cifsiod] > > After patch: > > # ps aux |grep cifs > root 1109 0.0 0.0 0 0 ? S< 15:11 0:00 [cifsiod] > root 1111 0.0 0.0 0 0 ? S< 15:11 0:00 [cifsiord] > > It looks to me that it does create new threads. Ah, I think that's because of the MEM_RECLAIM, not sure if that will normally participate in running works. Its been a long time since I looked at any of that. Lets wait for TJ to wake up. -- To unsubscribe from this list: send the line "unsubscribe linux-cifs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Steven, Peter. On Wed, Mar 19, 2014 at 08:34:07PM +0100, Peter Zijlstra wrote: > The way I understand workqueues is that we cannot guarantee concurrency > like this. It tries, but there's no guarantee. So, the guarantee is that if a workqueue has WQ_MEM_RECLAIM, it'll always have at least one worker thread working on it, so workqueues which may be depended upon during memory reclaim should have the flag set and must not require more than single level of concurrency to make forward progress. Workqueues w/o memory reclaim set depend on the fact that eventually memory will be reclaimed and enough number of workers necessary to make forward progress will be made available. > WQ_MAX_ACTIVE seems to be a hard upper limit of concurrent workers. So > given 511 other blocked works, the described problem will always happen. That actually is per-workqueue limit and workqueue core will try to create as many workers as possible to satisfy the demanded concurrency. ie. having two workqueues with the same max_active means that the total number of workers may reach 2 * max_active; however, this is no guarantee. If the system is under memory pressure and the workqueues don't have MEM_RECLAIM set, they may not get any concurrency until more memory is made available. > Creating another workqueue doesn't actually create more threads. It looks like the issue Steven is describing is caused by having a dependency chain longer than 1 through rwsem in a MEM_RECLAIM workqueue. Moving the write work items to a separate workqueue breaks the r-w-r chain and ensures that forward progress can be made with single level of concurrency on both workqueues, so, yeah, it looks like the correct fix to me. It it scarily subtle tho and quite likely to present in other code paths too. :( Thanks.
On Wed, 19 Mar 2014 15:12:52 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > We just had a customer report a deadlock in their 3.8-rt kernel. > Looking into this, it is very possible to have the same deadlock in > mainline in Linus's tree as it stands today. > > It is much easier to deadlock in the -rt kernel because reader locks > are serialized, where a down_read() can block another down_read(). But > because rwsems are fair locks, if a writer is waiting, a new reader > will then block. This means that if it is possible for a reader to > deadlock another reader, this can happen if a write comes along and > blocks on a current reader. That will prevent another reader from > running, and if that new reader requires to wake up a reader that owns > the lock, you have your deadlock. > > Here's the situation with CIFS and workqueues: > > The cifs system has several workqueues used in file.c and other > places. One of them is used for completion of a read and to release > the page_lock which wakes up the reader. There are several other > workqueues that do various other tasks. > > A holder of the reader lock can sleep on a page_lock() and expect the > reader workqueue to wake it up (page_unlock()). The reader workqueue > takes no locks so this does not seem to be a problem (but it is). > > The other workqueues can take the rwsem for read or for write. But our > issue that we tripped over was that it grabs it for read (remember in > -rt readers are serialized). But this can also happen if a separate > writer is waiting on the lock as that would cause a reader to block on > another reader too. > > All the workqueue callbacks are executed on the same workqueue: > > queue_work(cifsiod_wq, &rdata->work); > [...] > queue_work(cifsiod_wq, &cfile->oplock_break); > > Now if the reader workqueue callback is queued after one of these > workqueues that can take the rwsem, we can hit a deadlock. The > workqueue code looks to be able to prevent deadlocks of these kinds, > but I do not totally understand the workqueue scheduled work structure > and perhaps if the kworker thread structure blocks hard it wont move > works around. > > Here's what we see: > > rdata->work is scheduled after cfile->oplock_break > > CPU0 CPU1 > ---- ---- > > do_sync_read() > cifs_strict_readv() > down_read(cinode->lock_sem); > generic_file_aio_read() > __lock_page_killable() > __wait_on_bit_lock() > > * BLOCKED * > > process_one_work() > cifs_oplock_break() > cifs_has_mand_locks() > down_read(cinode->lock_sem); > > * BLOCKED * > > [ note, > cifs_oplock_break() can also call cifs_push_locks which takes > the lock with > down_write() ] > > > But we noticed that the rdata->work was queued to run under the same > workqueue task and this work is to wake up the owner of the semaphore. > But because the workqueue task is blocked waiting on that lock, it > will never wake it up. > > My question to Tejun is, if we create another workqueue, to add the > rdata->work to, would that prevent the above problem? Or what other > fixes can we do? > > This is only compiled tested, we have not given it to our customer > yet. > > Signed-off-by: Steven Rostedt <rostedt@goodmis.org> > Cc'ing Pavel since he wrote most of the lock pushing code... Nice analysis! I think eventually we'll need to overhaul this code not to use rw semaphores, but that's going to take some redesign. (Wonder if we could change it to use seqlocks or something?) Out of curiousity, does this eventually time out and unwedge itself? Usually when the server doesn't get a response to an oplock break in around a minute or so it gives up and allows the thing that caused the oplock break to proceed anyway. Not great for performance but it out to eventually make progress due to that. In any case, this looks like a reasonable fix for now, but I suspect you can hit similar problems in the write codepath too. What may be best is turn this around and queue the oplock break to the new workqueue instead of the read completion job. > diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c > index 849f613..6656058 100644 > --- a/fs/cifs/cifsfs.c > +++ b/fs/cifs/cifsfs.c > @@ -86,6 +86,7 @@ extern mempool_t *cifs_req_poolp; > extern mempool_t *cifs_mid_poolp; > > struct workqueue_struct *cifsiod_wq; > +struct workqueue_struct *cifsiord_wq; > > #ifdef CONFIG_CIFS_SMB2 > __u8 cifs_client_guid[SMB2_CLIENT_GUID_SIZE]; > @@ -1199,9 +1200,15 @@ init_cifs(void) > goto out_clean_proc; > } > > + cifsiord_wq = alloc_workqueue("cifsiord", > WQ_FREEZABLE|WQ_MEM_RECLAIM, 0); > + if (!cifsiord_wq) { > + rc = -ENOMEM; > + goto out_destroy_wq; > + } > + > rc = cifs_fscache_register(); > if (rc) > - goto out_destroy_wq; > + goto out_destroy_rwq; > > rc = cifs_init_inodecache(); > if (rc) > @@ -1249,6 +1256,8 @@ out_destroy_inodecache: > cifs_destroy_inodecache(); > out_unreg_fscache: > cifs_fscache_unregister(); > +out_destroy_rwq: > + destroy_workqueue(cifsiord_wq); > out_destroy_wq: > destroy_workqueue(cifsiod_wq); > out_clean_proc: > @@ -1273,6 +1282,7 @@ exit_cifs(void) > cifs_destroy_inodecache(); > cifs_fscache_unregister(); > destroy_workqueue(cifsiod_wq); > + destroy_workqueue(cifsiord_wq); > cifs_proc_clean(); > } > > diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h > index c0f3718..75d1941 100644 > --- a/fs/cifs/cifsglob.h > +++ b/fs/cifs/cifsglob.h > @@ -1561,6 +1561,7 @@ void cifs_oplock_break(struct work_struct > *work); > extern const struct slow_work_ops cifs_oplock_break_ops; > extern struct workqueue_struct *cifsiod_wq; > +extern struct workqueue_struct *cifsiord_wq; > > extern mempool_t *cifs_mid_poolp; > > diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c > index f3264bd..ca04a2e 100644 > --- a/fs/cifs/cifssmb.c > +++ b/fs/cifs/cifssmb.c > @@ -1571,7 +1571,7 @@ cifs_readv_callback(struct mid_q_entry *mid) > rdata->result = -EIO; > } > > - queue_work(cifsiod_wq, &rdata->work); > + queue_work(cifsiord_wq, &rdata->work); > DeleteMidQEntry(mid); > add_credits(server, 1, 0); > } > diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c > index 8603447..b74bf61 100644 > --- a/fs/cifs/smb2pdu.c > +++ b/fs/cifs/smb2pdu.c > @@ -1742,7 +1742,7 @@ smb2_readv_callback(struct mid_q_entry *mid) > if (rdata->result) > cifs_stats_fail_inc(tcon, SMB2_READ_HE); > > - queue_work(cifsiod_wq, &rdata->work); > + queue_work(cifsiord_wq, &rdata->work); > DeleteMidQEntry(mid); > add_credits(server, credits_received, 0); > }
On Thu, 20 Mar 2014 15:28:33 -0400 Jeffrey Layton <jlayton@redhat.com> wrote: > Nice analysis! I think eventually we'll need to overhaul this code not Note, Ulrich Obergfell helped a bit in the initial analysis. He found from a customer core dump that the kworker thread was blocked on the cinode->lock_sem, and the reader was blocked as well. That was enough for me to find where the problem laid. > to use rw semaphores, but that's going to take some redesign. (Wonder > if we could change it to use seqlocks or something?) > > Out of curiousity, does this eventually time out and unwedge itself? > Usually when the server doesn't get a response to an oplock break in > around a minute or so it gives up and allows the thing that caused the > oplock break to proceed anyway. Not great for performance but it out to > eventually make progress due to that. No, I believe it's hard locked. Nothing is going to wake up the oplock break if it is blocked on a down_read(). Only the release of the rwsem will do that. It's the subtle way the kworker threads are done. > > In any case, this looks like a reasonable fix for now, but I suspect you > can hit similar problems in the write codepath too. What may be best is > turn this around and queue the oplock break to the new workqueue > instead of the read completion job. Or perhaps give both the read and write their own workqueues? We have to look at all the work queue handlers, and be careful about any users that take the lock_sem, and separate them out. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-cifs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 20 Mar 2014 16:57:03 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > On Thu, 20 Mar 2014 15:28:33 -0400 > Jeffrey Layton <jlayton@redhat.com> wrote: > > > > Nice analysis! I think eventually we'll need to overhaul this code not > > Note, Ulrich Obergfell helped a bit in the initial analysis. He found > from a customer core dump that the kworker thread was blocked on the > cinode->lock_sem, and the reader was blocked as well. That was enough > for me to find where the problem laid. > Kudos to Uli, then ;) > > to use rw semaphores, but that's going to take some redesign. (Wonder > > if we could change it to use seqlocks or something?) > > > > Out of curiousity, does this eventually time out and unwedge itself? > > Usually when the server doesn't get a response to an oplock break in > > around a minute or so it gives up and allows the thing that caused the > > oplock break to proceed anyway. Not great for performance but it out to > > eventually make progress due to that. > > No, I believe it's hard locked. Nothing is going to wake up the oplock > break if it is blocked on a down_read(). Only the release of the rwsem > will do that. It's the subtle way the kworker threads are done. > Eventually the server should just allow the read to complete even if the client doesn't respond to the oplock break. It has to since clients can suddenly drop off the net while holding an oplock. That should allow everything to unwedge eventually (though it may take a while). If that's not happening then I'd be curious as to why... > > > > In any case, this looks like a reasonable fix for now, but I suspect you > > can hit similar problems in the write codepath too. What may be best is > > turn this around and queue the oplock break to the new workqueue > > instead of the read completion job. > > Or perhaps give both the read and write their own workqueues? We have > to look at all the work queue handlers, and be careful about any users > that take the lock_sem, and separate them out. > Yeah, I haven't looked closely yet but I'm fairly sure that you could hit the same situation in the write codepath as well. Whether adding more workqueues will really help, I'm not sure of yet...
On Wed, 19 Mar 2014 15:12:52 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > We just had a customer report a deadlock in their 3.8-rt kernel. > Looking into this, it is very possible to have the same deadlock in > mainline in Linus's tree as it stands today. > > It is much easier to deadlock in the -rt kernel because reader locks > are serialized, where a down_read() can block another down_read(). But > because rwsems are fair locks, if a writer is waiting, a new reader > will then block. This means that if it is possible for a reader to > deadlock another reader, this can happen if a write comes along and > blocks on a current reader. That will prevent another reader from > running, and if that new reader requires to wake up a reader that owns > the lock, you have your deadlock. > > Here's the situation with CIFS and workqueues: > > The cifs system has several workqueues used in file.c and other places. > One of them is used for completion of a read and to release the > page_lock which wakes up the reader. There are several other workqueues > that do various other tasks. > > A holder of the reader lock can sleep on a page_lock() and expect the > reader workqueue to wake it up (page_unlock()). The reader workqueue > takes no locks so this does not seem to be a problem (but it is). > > The other workqueues can take the rwsem for read or for write. But our > issue that we tripped over was that it grabs it for read (remember in > -rt readers are serialized). But this can also happen if a separate > writer is waiting on the lock as that would cause a reader to block on > another reader too. > > All the workqueue callbacks are executed on the same workqueue: > > queue_work(cifsiod_wq, &rdata->work); > [...] > queue_work(cifsiod_wq, &cfile->oplock_break); > > Now if the reader workqueue callback is queued after one of these > workqueues that can take the rwsem, we can hit a deadlock. The > workqueue code looks to be able to prevent deadlocks of these kinds, > but I do not totally understand the workqueue scheduled work structure > and perhaps if the kworker thread structure blocks hard it wont move > works around. > > Here's what we see: > > rdata->work is scheduled after cfile->oplock_break > > CPU0 CPU1 > ---- ---- > > do_sync_read() > cifs_strict_readv() > down_read(cinode->lock_sem); > generic_file_aio_read() > __lock_page_killable() > __wait_on_bit_lock() > > * BLOCKED * > > process_one_work() > cifs_oplock_break() > cifs_has_mand_locks() > down_read(cinode->lock_sem); > > * BLOCKED * > > [ note, cifs_oplock_break() can > also call cifs_push_locks which takes > the lock with down_write() ] > > Wait...why does the work running on CPU1 end up blocked on down_read()? Is it really getting stuck on the down_write you mention?
On Thu, 20 Mar 2014 19:53:46 -0400 Jeff Layton <jlayton@redhat.com> wrote: > Wait...why does the work running on CPU1 end up blocked on down_read()? > Is it really getting stuck on the down_write you mention? > rwsems are fair locks. Readers will not block on a reader lock unless there's a writer waiting. That's the key. As soon as a writer blocks on a lock that is held by a reader (or multiple readers), new readers coming in will also block to let the writer get a chance. Otherwise, it is a unfair lock and the readers can starve the writer. But people tend to forget that a waiting writer causes readers to block on each other, and if the reader locks can deadlock each other, they will deadlock with a writer waiting. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-cifs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 20 Mar 2014 17:02:39 -0400 Jeff Layton <jlayton@redhat.com> wrote: > Eventually the server should just allow the read to complete even if > the client doesn't respond to the oplock break. It has to since clients > can suddenly drop off the net while holding an oplock. That should > allow everything to unwedge eventually (though it may take a while). > > If that's not happening then I'd be curious as to why... The problem is that the data is being filled in the page and the reader is waiting for the page lock to be released. The kworker for the reader will issue the complete() and unlock the page to wake up the reader. But because the other workqueue callback calls down_read(), and there can be a down_write() waiting for the reader to finish, this down_read() will block on the lock as well (rwsems are fair locks). This blocks the other workqueue callback from issuing the complete and page_unlock() that will wake up the reader that is holding the rwsem with down_read(). DEADLOCK. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-cifs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2014-03-21 6:23 GMT+04:00 Steven Rostedt <rostedt@goodmis.org>: > On Thu, 20 Mar 2014 17:02:39 -0400 > Jeff Layton <jlayton@redhat.com> wrote: > >> Eventually the server should just allow the read to complete even if >> the client doesn't respond to the oplock break. It has to since clients >> can suddenly drop off the net while holding an oplock. That should >> allow everything to unwedge eventually (though it may take a while). >> >> If that's not happening then I'd be curious as to why... > > The problem is that the data is being filled in the page and the reader > is waiting for the page lock to be released. The kworker for the reader > will issue the complete() and unlock the page to wake up the reader. > > But because the other workqueue callback calls down_read(), and there > can be a down_write() waiting for the reader to finish, this > down_read() will block on the lock as well (rwsems are fair locks). > This blocks the other workqueue callback from issuing the complete and > page_unlock() that will wake up the reader that is holding the rwsem > with down_read(). > > DEADLOCK. Thank you for reporting and clarifying the issue! Read and write codepaths both obtain lock_sem for read and then wait for cifsiod_wq to complete and release lock_sem. They don't do any lock_sem operations inside their work task queued to cifsiod_wq. But oplock code can obtain/release lock_sem in its work task. So, that's why I agree with Jeff and suggest to move the oplock code to a different work queue (cifsioopd_wq?) but leave read and write codepaths use cifsiod_wq.
On Fri, 21 Mar 2014 12:32:12 +0400 Pavel Shilovsky <piastry@etersoft.ru> wrote: > 2014-03-21 6:23 GMT+04:00 Steven Rostedt <rostedt@goodmis.org>: > > On Thu, 20 Mar 2014 17:02:39 -0400 > > Jeff Layton <jlayton@redhat.com> wrote: > > > >> Eventually the server should just allow the read to complete even if > >> the client doesn't respond to the oplock break. It has to since clients > >> can suddenly drop off the net while holding an oplock. That should > >> allow everything to unwedge eventually (though it may take a while). > >> > >> If that's not happening then I'd be curious as to why... > > > > The problem is that the data is being filled in the page and the reader > > is waiting for the page lock to be released. The kworker for the reader > > will issue the complete() and unlock the page to wake up the reader. > > > > But because the other workqueue callback calls down_read(), and there > > can be a down_write() waiting for the reader to finish, this > > down_read() will block on the lock as well (rwsems are fair locks). > > This blocks the other workqueue callback from issuing the complete and > > page_unlock() that will wake up the reader that is holding the rwsem > > with down_read(). > > > > DEADLOCK. > > Thank you for reporting and clarifying the issue! > > Read and write codepaths both obtain lock_sem for read and then wait > for cifsiod_wq to complete and release lock_sem. They don't do any > lock_sem operations inside their work task queued to cifsiod_wq. But > oplock code can obtain/release lock_sem in its work task. So, that's > why I agree with Jeff and suggest to move the oplock code to a > different work queue (cifsioopd_wq?) but leave read and write > codepaths use cifsiod_wq. > Yes, thanks for the clarification. I missed the fact that a 3rd task was blocked on a down_write. I think that fix will help prevent the full-blown deadlock, but it'll still be susceptible to long stalls in some situations. In Steven's example the work on CPU0 is blocked on a SMB_READ. That read may not be completing because the server has issued an oplock break to the client and is waiting on the response. That oplock break response is blocked because it's blocked on the down_write. In that situation, what breaks the deadlock is that the server eventually gives up waiting on the oplock break response, but that can take on the order of a minute. So yeah, I think we should add a dedicated workqueue for oplock breaks as an interim fix, but I also think we need to consider revamping the lock pushing code such that oplock breaks are not subject to being blocked like that.
On Fri, 21 Mar 2014 12:32:12 +0400 Pavel Shilovsky <piastry@etersoft.ru> wrote: > Read and write codepaths both obtain lock_sem for read and then wait > for cifsiod_wq to complete and release lock_sem. They don't do any > lock_sem operations inside their work task queued to cifsiod_wq. But > oplock code can obtain/release lock_sem in its work task. So, that's > why I agree with Jeff and suggest to move the oplock code to a > different work queue (cifsioopd_wq?) but leave read and write > codepaths use cifsiod_wq. OK, how about I submit a second patch that moves the reader and writer to its own "safe" workqueue? -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-cifs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 21 Mar 2014 08:17:06 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > On Fri, 21 Mar 2014 12:32:12 +0400 > Pavel Shilovsky <piastry@etersoft.ru> wrote: > > > > Read and write codepaths both obtain lock_sem for read and then wait > > for cifsiod_wq to complete and release lock_sem. They don't do any > > lock_sem operations inside their work task queued to cifsiod_wq. But > > oplock code can obtain/release lock_sem in its work task. So, that's > > why I agree with Jeff and suggest to move the oplock code to a > > different work queue (cifsioopd_wq?) but leave read and write > > codepaths use cifsiod_wq. > > OK, how about I submit a second patch that moves the reader and writer > to its own "safe" workqueue? > > -- Steve > That'd probably work fine too. The main point is to make sure oplock breaks run on a different workqueue from where read or write completion jobs run since they are operating on the lock_sem. The other jobs that get queued to cifsiod_wq don't touch the lock_sem and shouldn't be a problem.
On Fri, 21 Mar 2014 08:41:28 -0400 Jeff Layton <jlayton@redhat.com> wrote: > That'd probably work fine too. The main point is to make sure oplock > breaks run on a different workqueue from where read or write completion > jobs run since they are operating on the lock_sem. The other jobs that > get queued to cifsiod_wq don't touch the lock_sem and shouldn't be a > problem. > OK, I'll take a look at them, and maybe I'll just move the oplock workqueue. I think you are correct and it may be best to move the one that takes the lock. Keep that one separate and that will keep the others from being blocked by it. Thanks, I'll write something up in a bit. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-cifs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c index 849f613..6656058 100644 --- a/fs/cifs/cifsfs.c +++ b/fs/cifs/cifsfs.c @@ -86,6 +86,7 @@ extern mempool_t *cifs_req_poolp; extern mempool_t *cifs_mid_poolp; struct workqueue_struct *cifsiod_wq; +struct workqueue_struct *cifsiord_wq; #ifdef CONFIG_CIFS_SMB2 __u8 cifs_client_guid[SMB2_CLIENT_GUID_SIZE]; @@ -1199,9 +1200,15 @@ init_cifs(void) goto out_clean_proc; } + cifsiord_wq = alloc_workqueue("cifsiord", WQ_FREEZABLE|WQ_MEM_RECLAIM, 0); + if (!cifsiord_wq) { + rc = -ENOMEM; + goto out_destroy_wq; + } + rc = cifs_fscache_register(); if (rc) - goto out_destroy_wq; + goto out_destroy_rwq; rc = cifs_init_inodecache(); if (rc) @@ -1249,6 +1256,8 @@ out_destroy_inodecache: cifs_destroy_inodecache(); out_unreg_fscache: cifs_fscache_unregister(); +out_destroy_rwq: + destroy_workqueue(cifsiord_wq); out_destroy_wq: destroy_workqueue(cifsiod_wq); out_clean_proc: @@ -1273,6 +1282,7 @@ exit_cifs(void) cifs_destroy_inodecache(); cifs_fscache_unregister(); destroy_workqueue(cifsiod_wq); + destroy_workqueue(cifsiord_wq); cifs_proc_clean(); } diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h index c0f3718..75d1941 100644 --- a/fs/cifs/cifsglob.h +++ b/fs/cifs/cifsglob.h @@ -1561,6 +1561,7 @@ void cifs_oplock_break(struct work_struct *work); extern const struct slow_work_ops cifs_oplock_break_ops; extern struct workqueue_struct *cifsiod_wq; +extern struct workqueue_struct *cifsiord_wq; extern mempool_t *cifs_mid_poolp; diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c index f3264bd..ca04a2e 100644 --- a/fs/cifs/cifssmb.c +++ b/fs/cifs/cifssmb.c @@ -1571,7 +1571,7 @@ cifs_readv_callback(struct mid_q_entry *mid) rdata->result = -EIO; } - queue_work(cifsiod_wq, &rdata->work); + queue_work(cifsiord_wq, &rdata->work); DeleteMidQEntry(mid); add_credits(server, 1, 0); } diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c index 8603447..b74bf61 100644 --- a/fs/cifs/smb2pdu.c +++ b/fs/cifs/smb2pdu.c @@ -1742,7 +1742,7 @@ smb2_readv_callback(struct mid_q_entry *mid) if (rdata->result) cifs_stats_fail_inc(tcon, SMB2_READ_HE); - queue_work(cifsiod_wq, &rdata->work); + queue_work(cifsiord_wq, &rdata->work); DeleteMidQEntry(mid); add_credits(server, credits_received, 0); }
We just had a customer report a deadlock in their 3.8-rt kernel. Looking into this, it is very possible to have the same deadlock in mainline in Linus's tree as it stands today. It is much easier to deadlock in the -rt kernel because reader locks are serialized, where a down_read() can block another down_read(). But because rwsems are fair locks, if a writer is waiting, a new reader will then block. This means that if it is possible for a reader to deadlock another reader, this can happen if a write comes along and blocks on a current reader. That will prevent another reader from running, and if that new reader requires to wake up a reader that owns the lock, you have your deadlock. Here's the situation with CIFS and workqueues: The cifs system has several workqueues used in file.c and other places. One of them is used for completion of a read and to release the page_lock which wakes up the reader. There are several other workqueues that do various other tasks. A holder of the reader lock can sleep on a page_lock() and expect the reader workqueue to wake it up (page_unlock()). The reader workqueue takes no locks so this does not seem to be a problem (but it is). The other workqueues can take the rwsem for read or for write. But our issue that we tripped over was that it grabs it for read (remember in -rt readers are serialized). But this can also happen if a separate writer is waiting on the lock as that would cause a reader to block on another reader too. All the workqueue callbacks are executed on the same workqueue: queue_work(cifsiod_wq, &rdata->work); [...] queue_work(cifsiod_wq, &cfile->oplock_break); Now if the reader workqueue callback is queued after one of these workqueues that can take the rwsem, we can hit a deadlock. The workqueue code looks to be able to prevent deadlocks of these kinds, but I do not totally understand the workqueue scheduled work structure and perhaps if the kworker thread structure blocks hard it wont move works around. Here's what we see: rdata->work is scheduled after cfile->oplock_break CPU0 CPU1 ---- ---- do_sync_read() cifs_strict_readv() down_read(cinode->lock_sem); generic_file_aio_read() __lock_page_killable() __wait_on_bit_lock() * BLOCKED * process_one_work() cifs_oplock_break() cifs_has_mand_locks() down_read(cinode->lock_sem); * BLOCKED * [ note, cifs_oplock_break() can also call cifs_push_locks which takes the lock with down_write() ] But we noticed that the rdata->work was queued to run under the same workqueue task and this work is to wake up the owner of the semaphore. But because the workqueue task is blocked waiting on that lock, it will never wake it up. My question to Tejun is, if we create another workqueue, to add the rdata->work to, would that prevent the above problem? Or what other fixes can we do? This is only compiled tested, we have not given it to our customer yet. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> -- To unsubscribe from this list: send the line "unsubscribe linux-cifs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html