Message ID | 20140731131331.GT19379@twins.programming.kicks-ass.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
* Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Jul 31, 2014 at 04:37:29PM +0400, Ilya Dryomov wrote: > > > This didn't make sense to me at first too, and I'll be happy to be > > proven wrong, but we can reproduce this with rbd very reliably under > > higher than usual load, and the revert makes it go away. What we are > > seeing in the rbd scenario is the following. > > This is drivers/block/rbd.c ? I can find but a single mutex_lock() in > there. > > > Suppose foo needs mutexes A and B, bar needs mutex B. foo acquires > > A and then wants to acquire B, but B is held by bar. foo spins > > a little and ends up calling schedule_preempt_disabled() on line 484 > > above, but that call never returns, even though a hundred usecs later > > bar releases B. foo ends up stuck in mutex_lock() indefinitely, but > > still holds A and everybody else who needs A gets behind A. Given that > > this A happens to be a central libceph mutex all rbd activity halts. > > Deadlock may not be the best term for this, but never returning from > > mutex_lock(&B) even though B has been unlocked is *a* problem. > > > > This obviously doesn't happen every time schedule_preempt_disabled() on > > line 484 is called, so there must be some sort of race here. I'll send > > along the actual rbd stack traces shortly. > > Smells like maybe current->state != TASK_RUNNING, does the below > trigger? > > If so, you've wrecked something in whatever... > > --- > kernel/locking/mutex.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c > index ae712b25e492..3d726fdaa764 100644 > --- a/kernel/locking/mutex.c > +++ b/kernel/locking/mutex.c > @@ -473,8 +473,12 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, > * reschedule now, before we try-lock the mutex. This avoids getting > * scheduled out right after we obtained the mutex. > */ > - if (need_resched()) > + if (need_resched()) { > + if (WARN_ON_ONCE(current->state != TASK_RUNNING)) > + __set_current_state(TASK_RUNNING); > + > schedule_preempt_disabled(); > + } Might make sense to add that debug check under mutex debugging or so, with a sensible kernel message printed. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2014-07-31 at 15:13 +0200, Peter Zijlstra wrote:
> Smells like maybe current->state != TASK_RUNNING
Bingo
[ 1200.851004] kjournald D 0000000000000002 0 4398 2 0x00000000
[ 1200.858283] ffff8803537bb788 0000000000000046 ffff8803537bb7a8 0000000000000000
[ 1200.865914] ffff880423324b60 0000000000012f80 ffff8803537bbfd8 0000000000012f80
[ 1200.873590] ffff88042937cb60 ffff880423324b60 ffff880428ceb240 ffff8804231e59b8
[ 1200.881256] Call Trace:
[ 1200.883724] [<ffffffff816981c9>] schedule+0x29/0x70
[ 1200.888798] [<ffffffff8169849e>] schedule_preempt_disabled+0xe/0x10
[ 1200.895239] [<ffffffff81699fe5>] __mutex_lock_slowpath+0x1b5/0x1c0
[ 1200.901673] [<ffffffffa0479826>] ? ceph_str_hash+0x26/0x80 [libceph]
[ 1200.908198] [<ffffffff8169a013>] mutex_lock+0x23/0x37
[ 1200.913430] [<ffffffffa046751d>] ceph_con_send+0x4d/0x130 [libceph]
[ 1200.919912] [<ffffffffa046c540>] __send_queued+0x120/0x150 [libceph]
[ 1200.926444] [<ffffffffa046ec5b>] __ceph_osdc_start_request+0x5b/0xd0 [libceph]
[ 1200.933855] [<ffffffffa046ed21>] ceph_osdc_start_request+0x51/0x80 [libceph]
[ 1200.941126] [<ffffffffa042bf60>] rbd_obj_request_submit.isra.25+0x10/0x20 [rbd]
[ 1200.948622] [<ffffffffa042e8ee>] rbd_img_obj_request_submit+0x1ce/0x460 [rbd]
[ 1200.956040] [<ffffffffa042ebcc>] rbd_img_request_submit+0x4c/0x60 [rbd]
[ 1200.962845] [<ffffffffa042f2a8>] rbd_request_fn+0x238/0x290 [rbd]
[ 1200.969108] [<ffffffff8133a397>] __blk_run_queue+0x37/0x50
[ 1200.974764] [<ffffffff8133affd>] queue_unplugged+0x3d/0xc0
[ 1200.980424] [<ffffffff8133fddb>] blk_flush_plug_list+0x1db/0x210
[ 1200.986635] [<ffffffff81698288>] io_schedule+0x78/0xd0
[ 1200.991954] [<ffffffff8133b864>] get_request+0x414/0x800
[ 1200.997440] [<ffffffff8133f477>] ? bio_attempt_back_merge+0x37/0x100
[ 1201.004013] [<ffffffff8109b9e0>] ? __wake_up_sync+0x20/0x20
[ 1201.009782] [<ffffffff8133ff2c>] blk_queue_bio+0xcc/0x360
[ 1201.015353] [<ffffffff8133c2d0>] generic_make_request+0xc0/0x100
[ 1201.021605] [<ffffffff8133c385>] submit_bio+0x75/0x140
[ 1201.026921] [<ffffffff811de4e6>] _submit_bh+0x136/0x1f0
[ 1201.032390] [<ffffffff81290081>] journal_do_submit_data+0x41/0x50
[ 1201.038662] [<ffffffff81291380>] journal_commit_transaction+0x1150/0x1350
[ 1201.045683] [<ffffffff81063aff>] ? try_to_del_timer_sync+0x4f/0x70
[ 1201.052043] [<ffffffff81293e01>] kjournald+0xe1/0x260
[ 1201.057324] [<ffffffff8109b9e0>] ? __wake_up_sync+0x20/0x20
[ 1201.063072] [<ffffffff81293d20>] ? commit_timeout+0x10/0x10
[ 1201.068855] [<ffffffff81078829>] kthread+0xc9/0xe0
[ 1201.073819] [<ffffffff81078760>] ? flush_kthread_worker+0xb0/0xb0
[ 1201.080084] [<ffffffff8169bb6c>] ret_from_fork+0x7c/0xb0
[ 1201.085573] [<ffffffff81078760>] ? flush_kthread_worker+0xb0/0xb0
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 31, 2014 at 6:30 PM, Mike Galbraith <umgwanakikbuti@gmail.com> wrote: > On Thu, 2014-07-31 at 15:13 +0200, Peter Zijlstra wrote: > >> Smells like maybe current->state != TASK_RUNNING It just triggered for me too, took longer than usual. Sorry for the churn Peter, this was really confusing. Onto finding the real bug.. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 31, 2014 at 04:30:52PM +0200, Mike Galbraith wrote: > On Thu, 2014-07-31 at 15:13 +0200, Peter Zijlstra wrote: > > > Smells like maybe current->state != TASK_RUNNING > > Bingo > > [ 1200.851004] kjournald D 0000000000000002 0 4398 2 0x00000000 > [ 1200.858283] ffff8803537bb788 0000000000000046 ffff8803537bb7a8 0000000000000000 > [ 1200.865914] ffff880423324b60 0000000000012f80 ffff8803537bbfd8 0000000000012f80 > [ 1200.873590] ffff88042937cb60 ffff880423324b60 ffff880428ceb240 ffff8804231e59b8 > [ 1200.881256] Call Trace: > [ 1200.883724] [<ffffffff816981c9>] schedule+0x29/0x70 > [ 1200.888798] [<ffffffff8169849e>] schedule_preempt_disabled+0xe/0x10 > [ 1200.895239] [<ffffffff81699fe5>] __mutex_lock_slowpath+0x1b5/0x1c0 > [ 1200.901673] [<ffffffffa0479826>] ? ceph_str_hash+0x26/0x80 [libceph] > [ 1200.908198] [<ffffffff8169a013>] mutex_lock+0x23/0x37 > [ 1200.913430] [<ffffffffa046751d>] ceph_con_send+0x4d/0x130 [libceph] > [ 1200.919912] [<ffffffffa046c540>] __send_queued+0x120/0x150 [libceph] > [ 1200.926444] [<ffffffffa046ec5b>] __ceph_osdc_start_request+0x5b/0xd0 [libceph] > [ 1200.933855] [<ffffffffa046ed21>] ceph_osdc_start_request+0x51/0x80 [libceph] > [ 1200.941126] [<ffffffffa042bf60>] rbd_obj_request_submit.isra.25+0x10/0x20 [rbd] > [ 1200.948622] [<ffffffffa042e8ee>] rbd_img_obj_request_submit+0x1ce/0x460 [rbd] > [ 1200.956040] [<ffffffffa042ebcc>] rbd_img_request_submit+0x4c/0x60 [rbd] > [ 1200.962845] [<ffffffffa042f2a8>] rbd_request_fn+0x238/0x290 [rbd] > [ 1200.969108] [<ffffffff8133a397>] __blk_run_queue+0x37/0x50 > [ 1200.974764] [<ffffffff8133affd>] queue_unplugged+0x3d/0xc0 > [ 1200.980424] [<ffffffff8133fddb>] blk_flush_plug_list+0x1db/0x210 > [ 1200.986635] [<ffffffff81698288>] io_schedule+0x78/0xd0 > [ 1200.991954] [<ffffffff8133b864>] get_request+0x414/0x800 > [ 1200.997440] [<ffffffff8133f477>] ? bio_attempt_back_merge+0x37/0x100 > [ 1201.004013] [<ffffffff8109b9e0>] ? __wake_up_sync+0x20/0x20 > [ 1201.009782] [<ffffffff8133ff2c>] blk_queue_bio+0xcc/0x360 > [ 1201.015353] [<ffffffff8133c2d0>] generic_make_request+0xc0/0x100 > [ 1201.021605] [<ffffffff8133c385>] submit_bio+0x75/0x140 > [ 1201.026921] [<ffffffff811de4e6>] _submit_bh+0x136/0x1f0 > [ 1201.032390] [<ffffffff81290081>] journal_do_submit_data+0x41/0x50 > [ 1201.038662] [<ffffffff81291380>] journal_commit_transaction+0x1150/0x1350 > [ 1201.045683] [<ffffffff81063aff>] ? try_to_del_timer_sync+0x4f/0x70 > [ 1201.052043] [<ffffffff81293e01>] kjournald+0xe1/0x260 > [ 1201.057324] [<ffffffff8109b9e0>] ? __wake_up_sync+0x20/0x20 > [ 1201.063072] [<ffffffff81293d20>] ? commit_timeout+0x10/0x10 > [ 1201.068855] [<ffffffff81078829>] kthread+0xc9/0xe0 > [ 1201.073819] [<ffffffff81078760>] ? flush_kthread_worker+0xb0/0xb0 > [ 1201.080084] [<ffffffff8169bb6c>] ret_from_fork+0x7c/0xb0 > [ 1201.085573] [<ffffffff81078760>] ? flush_kthread_worker+0xb0/0xb0 Ohh. that's properly broken indeed. You can't just call blocking primitives on the way to schedule(), that's fail. Also, if I look at blk_flush_plug_list(), it calls queue_unplugged() with IRQs disabled, so _who_ is enabling them again and calling blocking stuff? /me stares more.. rbd_request_fn() does.. *argh* Someone needs to go fix, this cannot work right.
On Thu, Jul 31, 2014 at 6:39 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Jul 31, 2014 at 04:30:52PM +0200, Mike Galbraith wrote: >> On Thu, 2014-07-31 at 15:13 +0200, Peter Zijlstra wrote: >> >> > Smells like maybe current->state != TASK_RUNNING >> >> Bingo >> >> [ 1200.851004] kjournald D 0000000000000002 0 4398 2 0x00000000 >> [ 1200.858283] ffff8803537bb788 0000000000000046 ffff8803537bb7a8 0000000000000000 >> [ 1200.865914] ffff880423324b60 0000000000012f80 ffff8803537bbfd8 0000000000012f80 >> [ 1200.873590] ffff88042937cb60 ffff880423324b60 ffff880428ceb240 ffff8804231e59b8 >> [ 1200.881256] Call Trace: >> [ 1200.883724] [<ffffffff816981c9>] schedule+0x29/0x70 >> [ 1200.888798] [<ffffffff8169849e>] schedule_preempt_disabled+0xe/0x10 >> [ 1200.895239] [<ffffffff81699fe5>] __mutex_lock_slowpath+0x1b5/0x1c0 >> [ 1200.901673] [<ffffffffa0479826>] ? ceph_str_hash+0x26/0x80 [libceph] >> [ 1200.908198] [<ffffffff8169a013>] mutex_lock+0x23/0x37 >> [ 1200.913430] [<ffffffffa046751d>] ceph_con_send+0x4d/0x130 [libceph] >> [ 1200.919912] [<ffffffffa046c540>] __send_queued+0x120/0x150 [libceph] >> [ 1200.926444] [<ffffffffa046ec5b>] __ceph_osdc_start_request+0x5b/0xd0 [libceph] >> [ 1200.933855] [<ffffffffa046ed21>] ceph_osdc_start_request+0x51/0x80 [libceph] >> [ 1200.941126] [<ffffffffa042bf60>] rbd_obj_request_submit.isra.25+0x10/0x20 [rbd] >> [ 1200.948622] [<ffffffffa042e8ee>] rbd_img_obj_request_submit+0x1ce/0x460 [rbd] >> [ 1200.956040] [<ffffffffa042ebcc>] rbd_img_request_submit+0x4c/0x60 [rbd] >> [ 1200.962845] [<ffffffffa042f2a8>] rbd_request_fn+0x238/0x290 [rbd] >> [ 1200.969108] [<ffffffff8133a397>] __blk_run_queue+0x37/0x50 >> [ 1200.974764] [<ffffffff8133affd>] queue_unplugged+0x3d/0xc0 >> [ 1200.980424] [<ffffffff8133fddb>] blk_flush_plug_list+0x1db/0x210 >> [ 1200.986635] [<ffffffff81698288>] io_schedule+0x78/0xd0 >> [ 1200.991954] [<ffffffff8133b864>] get_request+0x414/0x800 >> [ 1200.997440] [<ffffffff8133f477>] ? bio_attempt_back_merge+0x37/0x100 >> [ 1201.004013] [<ffffffff8109b9e0>] ? __wake_up_sync+0x20/0x20 >> [ 1201.009782] [<ffffffff8133ff2c>] blk_queue_bio+0xcc/0x360 >> [ 1201.015353] [<ffffffff8133c2d0>] generic_make_request+0xc0/0x100 >> [ 1201.021605] [<ffffffff8133c385>] submit_bio+0x75/0x140 >> [ 1201.026921] [<ffffffff811de4e6>] _submit_bh+0x136/0x1f0 >> [ 1201.032390] [<ffffffff81290081>] journal_do_submit_data+0x41/0x50 >> [ 1201.038662] [<ffffffff81291380>] journal_commit_transaction+0x1150/0x1350 >> [ 1201.045683] [<ffffffff81063aff>] ? try_to_del_timer_sync+0x4f/0x70 >> [ 1201.052043] [<ffffffff81293e01>] kjournald+0xe1/0x260 >> [ 1201.057324] [<ffffffff8109b9e0>] ? __wake_up_sync+0x20/0x20 >> [ 1201.063072] [<ffffffff81293d20>] ? commit_timeout+0x10/0x10 >> [ 1201.068855] [<ffffffff81078829>] kthread+0xc9/0xe0 >> [ 1201.073819] [<ffffffff81078760>] ? flush_kthread_worker+0xb0/0xb0 >> [ 1201.080084] [<ffffffff8169bb6c>] ret_from_fork+0x7c/0xb0 >> [ 1201.085573] [<ffffffff81078760>] ? flush_kthread_worker+0xb0/0xb0 > > Ohh. that's properly broken indeed. > > You can't just call blocking primitives on the way to schedule(), that's > fail. > > Also, if I look at blk_flush_plug_list(), it calls queue_unplugged() > with IRQs disabled, so _who_ is enabling them again and calling blocking > stuff? > > /me stares more.. > > rbd_request_fn() does.. *argh* > > Someone needs to go fix, this cannot work right. I'm going to fix up rbd_request_fn(), but I want to make sure I understand this in full. - Previously the danger of calling blocking primitives on the way to schedule(), i.e. with task->state != TASK_RUNNING, was that if the blocking primitive was indeed to block the task state would be set back to TASK_RUNNING and the schedule() that that task was on the way to wouldn't have any effect. Your "Add extra reschedule point" patch essentially made calling mutex_lock() and probably others much more wrong that it used to be, because mutex_lock() may now reschedule when the task is not on the mutex wait queue. - There is nothing wrong with releasing queue_lock and reenabling IRQs in rbd_request_fn() as long as it doesn't block and I remember to disable IRQs and take queue_lock back on return. I'm asking because rbd_request_fn() is probably not the only broken in this way code path. I poked around and found read_events() in aio.c, it seems to have been written with the "danger" assumption that I outlined above and there is even a comment to it. Does that above make sense or am I missing something? Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 01, 2014 at 04:56:27PM +0400, Ilya Dryomov wrote: > I'm going to fix up rbd_request_fn(), but I want to make sure > I understand this in full. > > - Previously the danger of calling blocking primitives on the way to > schedule(), i.e. with task->state != TASK_RUNNING, was that if the > blocking primitive was indeed to block the task state would be set > back to TASK_RUNNING and the schedule() that that task was on the way > to wouldn't have any effect. Your "Add extra reschedule point" patch > essentially made calling mutex_lock() and probably others much more > wrong that it used to be, because mutex_lock() may now reschedule > when the task is not on the mutex wait queue. Right and in general we cannot allow spurious wakeups (although we try very hard to deal with them in generic code, which is why things more or less worked). But if you do a patch that 'randomly' ignores ->state on schedule (I did one) stuff comes apart _real_ quick. Therefore you should very much not destroy ->state on the way to schedule. > - There is nothing wrong with releasing queue_lock and reenabling IRQs > in rbd_request_fn() as long as it doesn't block and I remember to > disable IRQs and take queue_lock back on return. Releasing queue_lock might be ok, dunno about the blk locking, however reenabling IRQs it is actually wrong as per blk_flush_plug_list() since that uses local_irq_save()/restore() which means it can be called from contexts which cannot deal with enabling IRQs, and then your ->request_fn() goes and does that. Now maybe blk_flush_plug_list() is overly paranoid and it could use local_irq_disable()/enable() instead, I don't know. But until it does, a request_fn() should never reenable IRQs. > I'm asking because rbd_request_fn() is probably not the only broken in > this way code path. I poked around and found read_events() in aio.c, > it seems to have been written with the "danger" assumption that > I outlined above and there is even a comment to it. I'm fairly sure there's more broken stuff, I didn't dare looking. > Does that above make sense or am I missing something? I think that's about it.
On Fri, Aug 1, 2014 at 5:27 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Aug 01, 2014 at 04:56:27PM +0400, Ilya Dryomov wrote: >> I'm going to fix up rbd_request_fn(), but I want to make sure >> I understand this in full. >> >> - Previously the danger of calling blocking primitives on the way to >> schedule(), i.e. with task->state != TASK_RUNNING, was that if the >> blocking primitive was indeed to block the task state would be set >> back to TASK_RUNNING and the schedule() that that task was on the way >> to wouldn't have any effect. Your "Add extra reschedule point" patch >> essentially made calling mutex_lock() and probably others much more >> wrong that it used to be, because mutex_lock() may now reschedule >> when the task is not on the mutex wait queue. > > Right and in general we cannot allow spurious wakeups (although we try > very hard to deal with them in generic code, which is why things more or > less worked). > > But if you do a patch that 'randomly' ignores ->state on schedule (I did > one) stuff comes apart _real_ quick. > > Therefore you should very much not destroy ->state on the way to > schedule. > >> - There is nothing wrong with releasing queue_lock and reenabling IRQs >> in rbd_request_fn() as long as it doesn't block and I remember to >> disable IRQs and take queue_lock back on return. > > Releasing queue_lock might be ok, dunno about the blk locking, however > reenabling IRQs it is actually wrong as per blk_flush_plug_list() since > that uses local_irq_save()/restore() which means it can be called from > contexts which cannot deal with enabling IRQs, and then your > ->request_fn() goes and does that. > > Now maybe blk_flush_plug_list() is overly paranoid and it could use > local_irq_disable()/enable() instead, I don't know. But until it does, a > request_fn() should never reenable IRQs. > >> I'm asking because rbd_request_fn() is probably not the only broken in >> this way code path. I poked around and found read_events() in aio.c, >> it seems to have been written with the "danger" assumption that >> I outlined above and there is even a comment to it. > > I'm fairly sure there's more broken stuff, I didn't dare looking. > >> Does that above make sense or am I missing something? > > I think that's about it. Thanks for clarifying things. CC'ing Kent to draw attention to read_events(). Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index ae712b25e492..3d726fdaa764 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -473,8 +473,12 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, * reschedule now, before we try-lock the mutex. This avoids getting * scheduled out right after we obtained the mutex. */ - if (need_resched()) + if (need_resched()) { + if (WARN_ON_ONCE(current->state != TASK_RUNNING)) + __set_current_state(TASK_RUNNING); + schedule_preempt_disabled(); + } #endif spin_lock_mutex(&lock->wait_lock, flags);