Message ID | 20180407102148.GA9729@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
(cc'ing Joseph as he worked on the area recently, hi!) Hello, On Sat, Apr 07, 2018 at 12:21:48PM +0200, Alexandru Moise wrote: > The q->id is used as an index within the blkg_tree radix tree. > > If the entry is not released before reclaiming the blk_queue_ida's id > blkcg_init_queue() within a different driver from which this id > was originally for can fail due to the entry at that index within > the radix tree already existing. > > Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> > --- > v2: Added no-op for !CONFIG_BLK_CGROUP > > block/blk-cgroup.c | 2 +- > block/blk-sysfs.c | 4 ++++ > include/linux/blk-cgroup.h | 3 +++ > 3 files changed, 8 insertions(+), 1 deletion(-) > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > index 1c16694ae145..224e937dbb59 100644 > --- a/block/blk-cgroup.c > +++ b/block/blk-cgroup.c > @@ -369,7 +369,7 @@ static void blkg_destroy(struct blkcg_gq *blkg) > * > * Destroy all blkgs associated with @q. > */ > -static void blkg_destroy_all(struct request_queue *q) > +void blkg_destroy_all(struct request_queue *q) > { > struct blkcg_gq *blkg, *n; > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > index d00d1b0ec109..a72866458f22 100644 > --- a/block/blk-sysfs.c > +++ b/block/blk-sysfs.c > @@ -816,6 +816,10 @@ static void __blk_release_queue(struct work_struct *work) > if (q->bio_split) > bioset_free(q->bio_split); > > + spin_lock_irq(q->queue_lock); > + blkg_destroy_all(q); > + spin_unlock_irq(q->queue_lock); > + > ida_simple_remove(&blk_queue_ida, q->id); > call_rcu(&q->rcu_head, blk_free_queue_rcu); But we already do this through calling blkcg_exit_queue() from __blk_release_queue(). What's missing? Thanks.
On Mon, Apr 09, 2018 at 03:09:38PM -0700, Tejun Heo wrote: > (cc'ing Joseph as he worked on the area recently, hi!) > > Hello, > > On Sat, Apr 07, 2018 at 12:21:48PM +0200, Alexandru Moise wrote: > > The q->id is used as an index within the blkg_tree radix tree. > > > > If the entry is not released before reclaiming the blk_queue_ida's id > > blkcg_init_queue() within a different driver from which this id > > was originally for can fail due to the entry at that index within > > the radix tree already existing. > > > > Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> > > --- > > v2: Added no-op for !CONFIG_BLK_CGROUP > > > > block/blk-cgroup.c | 2 +- > > block/blk-sysfs.c | 4 ++++ > > include/linux/blk-cgroup.h | 3 +++ > > 3 files changed, 8 insertions(+), 1 deletion(-) > > > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > > index 1c16694ae145..224e937dbb59 100644 > > --- a/block/blk-cgroup.c > > +++ b/block/blk-cgroup.c > > @@ -369,7 +369,7 @@ static void blkg_destroy(struct blkcg_gq *blkg) > > * > > * Destroy all blkgs associated with @q. > > */ > > -static void blkg_destroy_all(struct request_queue *q) > > +void blkg_destroy_all(struct request_queue *q) > > { > > struct blkcg_gq *blkg, *n; > > > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > > index d00d1b0ec109..a72866458f22 100644 > > --- a/block/blk-sysfs.c > > +++ b/block/blk-sysfs.c > > @@ -816,6 +816,10 @@ static void __blk_release_queue(struct work_struct *work) > > if (q->bio_split) > > bioset_free(q->bio_split); > > > > + spin_lock_irq(q->queue_lock); > > + blkg_destroy_all(q); > > + spin_unlock_irq(q->queue_lock); > > + > > ida_simple_remove(&blk_queue_ida, q->id); > > call_rcu(&q->rcu_head, blk_free_queue_rcu); > > But we already do this through calling blkcg_exit_queue() from > __blk_release_queue(). What's missing? Hi, It might be the jetlag but I can't see how you end up calling blkcg_exit_queue() from __blk_release_queue(). As I see it the only way to reach blkcg_exit_queue() is from blk_cleanup_queue(), which I don't see anywhere in __blk_release_queue(). I suspect that I'm just fixing a corner case though and the general case is what you describe or similar. ../Alex > > Thanks. > > -- > tejun
Hello, On Wed, Apr 11, 2018 at 12:12:56PM +0200, Alexandru Moise wrote: > > But we already do this through calling blkcg_exit_queue() from > > __blk_release_queue(). What's missing? > > Hi, > > It might be the jetlag but I can't see how you end up calling > blkcg_exit_queue() from __blk_release_queue(). > > As I see it the only way to reach blkcg_exit_queue() is from > blk_cleanup_queue(), which I don't see anywhere in __blk_release_queue(). > > I suspect that I'm just fixing a corner case though and > the general case is what you describe or similar. Ah, that changed recently. Can you please check out the current upstream git master? Thanks.
On Wed, Apr 11, 2018 at 07:20:19AM -0700, Tejun Heo wrote: > Hello, > > On Wed, Apr 11, 2018 at 12:12:56PM +0200, Alexandru Moise wrote: > > > But we already do this through calling blkcg_exit_queue() from > > > __blk_release_queue(). What's missing? > > > > Hi, > > > > It might be the jetlag but I can't see how you end up calling > > blkcg_exit_queue() from __blk_release_queue(). > > > > As I see it the only way to reach blkcg_exit_queue() is from > > blk_cleanup_queue(), which I don't see anywhere in __blk_release_queue(). > > > > I suspect that I'm just fixing a corner case though and > > the general case is what you describe or similar. > > Ah, that changed recently. Can you please check out the current > upstream git master? > > Thanks. > Just did, without my patch I see this crash: [ 0.759999] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 4.16.0-ARCH+ #81 [7/1949] [ 0.759999] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/204 [ 0.759999] RIP: 0010:pi_init+0x23f/0x2f0 [ 0.759999] RSP: 0000:ffffc90000197d90 EFLAGS: 00010246 [ 0.759999] RAX: 0000000000000000 RBX: 0000000000000020 RCX: 0000000000000038 [ 0.759999] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000000 [ 0.759999] RBP: ffffc90000197e18 R08: 00000000ffffffff R09: 00000000ffffffff [ 0.759999] R10: ffffea0000eda600 R11: ffff88003b69f164 R12: ffffffff82e2d740 [ 0.759999] R13: 00000000ffffffff R14: 0000000000000000 R15: 0000000000000000 [ 0.759999] FS: 0000000000000000(0000) GS:ffff88003e500000(0000) knlGS:0000000000000000 [ 0.759999] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.759999] CR2: 0000000000000008 CR3: 0000000002209001 CR4: 00000000000606e0 [ 0.759999] Call Trace: [ 0.759999] pf_init+0x1db/0x3be [ 0.759999] ? pcd_init+0x4e8/0x4e8 [ 0.759999] do_one_initcall+0x9e/0x1b0 [ 0.759999] ? do_early_param+0x97/0x97 [ 0.759999] kernel_init_freeable+0x259/0x2fd [ 0.759999] ? rest_init+0xd0/0xd0 [ 0.759999] ? syscall_slow_exit_work+0x1c/0x160 [ 0.759999] kernel_init+0xe/0x100 [ 0.759999] ret_from_fork+0x3a/0x50 [ 0.759999] Code: 75 6a 49 8b 06 48 8b 40 78 48 85 c0 74 08 4c 89 f7 e8 46 76 51 00 83 c3 01 3b 5d a8 7d 0d 49 [ 0.759999] RIP: pi_init+0x23f/0x2f0 RSP: ffffc90000197d90 [ 0.759999] CR2: 0000000000000008 [ 0.759999] ---[ end trace 12004f267bb8bf7d ]--- [ 0.766666] BUG: unable to handle kernel NULL pointer dereference at 00000000000001b4 [ 0.763350] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 [ 0.763350] [ 0.766666] PGD 0 P4D 0 [ 0.766666] Oops: 0000 [#2] PREEMPT SMP [ 0.766666] CPU: 0 PID: 6 Comm: kworker/u12:0 Tainted: G D 4.16.0-ARCH+ #81 [ 0.766666] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/20$ [ 0.766666] Workqueue: nvme-reset-wq nvme_reset_work [ 0.766666] RIP: 0010:blk_queue_flag_set+0xf/0x40 [ 0.766666] RSP: 0000:ffffc900001bfcb0 EFLAGS: 00010246 [ 0.766666] RAX: ffff88003b698000 RBX: 0000000000000000 RCX: 0000000000000000 [ 0.766666] RDX: ffff88003b698000 RSI: fffffffffffffff4 RDI: 000000000000001c [ 0.766666] RBP: ffffc900001bfcc0 R08: 0000000000000000 R09: 0000000000000000 [ 0.766666] R10: ffffea0000eaa980 R11: ffffffff814e0970 R12: 000000000000001c [ 0.766666] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88003aad8010 [ 0.766666] FS: 0000000000000000(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000 [ 0.766666] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.766666] CR2: 00000000000001b4 CR3: 0000000002209001 CR4: 00000000000606f0 [ 0.766666] Call Trace: [ 0.766666] blk_mq_quiesce_queue+0x23/0x80 [ 0.766666] nvme_dev_disable+0x34f/0x480 [ 0.766666] ? nvme_irq+0x50/0x50 [ 0.766666] ? dev_warn+0x64/0x80 [ 0.766666] nvme_reset_work+0x13de/0x1570 [ 0.766666] ? __switch_to_asm+0x34/0x70 [ 0.766666] ? __switch_to_asm+0x40/0x70 [ 0.766666] ? _raw_spin_unlock_irq+0x15/0x30 [ 0.766666] ? finish_task_switch+0x156/0x210 [ 0.766666] process_one_work+0x20c/0x3d0 [ 0.766666] worker_thread+0x216/0x400 [ 0.766666] kthread+0x125/0x130 [ 0.766666] ? process_one_work+0x3d0/0x3d0 [ 0.766666] ? __kthread_bind_mask+0x60/0x60 [ 0.766666] ret_from_fork+0x3a/0x50 With the patch the crash goes away, Thanks, ../Alex > -- > tejun
Hello, On Wed, Apr 11, 2018 at 04:28:59PM +0200, Alexandru Moise wrote: > > Ah, that changed recently. Can you please check out the current > > upstream git master? > > > Just did, without my patch I see this crash: lol I was looking at the old tree, so this is the fix for the new breakage introduced by the recent change. Sorry about the confusion. Joseph, can you please take a look? Thanks.
Hello, (cc'ing Bart) On Wed, Apr 11, 2018 at 07:46:16AM -0700, Tejun Heo wrote: > Hello, > > On Wed, Apr 11, 2018 at 04:28:59PM +0200, Alexandru Moise wrote: > > > Ah, that changed recently. Can you please check out the current > > > upstream git master? > > > > > Just did, without my patch I see this crash: > > lol I was looking at the old tree, so this is the fix for the new > breakage introduced by the recent change. Sorry about the confusion. > Joseph, can you please take a look? Oh, it wasn't Joseph's change. It was Bart's fix for a problem reported by Joseph. Bart, a063057d7c73 ("block: Fix a race between request queue removal and the block cgroup controller") created a regression where a request_queue can be destroyed with blkgs still attached. The original report is.. http://lkml.kernel.org/r/20180407102148.GA9729@gmail.com Thanks.
Hello, again. On Wed, Apr 11, 2018 at 07:51:23AM -0700, Tejun Heo wrote: > Oh, it wasn't Joseph's change. It was Bart's fix for a problem > reported by Joseph. Bart, a063057d7c73 ("block: Fix a race between > request queue removal and the block cgroup controller") created a > regression where a request_queue can be destroyed with blkgs still > attached. The original report is.. > > http://lkml.kernel.org/r/20180407102148.GA9729@gmail.com And looking at the change, it looks like the right thing we should have done is caching @lock on the print_blkg side and when switching locks make sure both locks are held. IOW, do the following in blk_cleanup_queue() spin_lock_irq(lock); if (q->queue_lock != &q->__queue_lock) { spin_lock(&q->__queue_lock); q->queue_lock = &q->__queue_lock; spin_unlock(&q->__queue_lock); } spin_unlock_irq(lock); Otherwise, there can be two lock holders thinking they have exclusive access to the request_queue. Thanks.
On Wed, 2018-04-11 at 16:28 +0200, Alexandru Moise wrote: > [ 0.766666] BUG: unable to handle kernel NULL pointer dereference at 00000000000001b4 > [ 0.763350] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 > [ 0.763350] > [ 0.766666] PGD 0 P4D 0 > [ 0.766666] Oops: 0000 [#2] PREEMPT SMP > [ 0.766666] CPU: 0 PID: 6 Comm: kworker/u12:0 Tainted: G D 4.16.0-ARCH+ #81 > [ 0.766666] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/20$ > [ 0.766666] Workqueue: nvme-reset-wq nvme_reset_work > [ 0.766666] RIP: 0010:blk_queue_flag_set+0xf/0x40 > [ 0.766666] RSP: 0000:ffffc900001bfcb0 EFLAGS: 00010246 > [ 0.766666] RAX: ffff88003b698000 RBX: 0000000000000000 RCX: 0000000000000000 > [ 0.766666] RDX: ffff88003b698000 RSI: fffffffffffffff4 RDI: 000000000000001c > [ 0.766666] RBP: ffffc900001bfcc0 R08: 0000000000000000 R09: 0000000000000000 > [ 0.766666] R10: ffffea0000eaa980 R11: ffffffff814e0970 R12: 000000000000001c > [ 0.766666] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88003aad8010 > [ 0.766666] FS: 0000000000000000(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000 > [ 0.766666] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 0.766666] CR2: 00000000000001b4 CR3: 0000000002209001 CR4: 00000000000606f0 > [ 0.766666] Call Trace: > [ 0.766666] blk_mq_quiesce_queue+0x23/0x80 > [ 0.766666] nvme_dev_disable+0x34f/0x480 > [ 0.766666] ? nvme_irq+0x50/0x50 > [ 0.766666] ? dev_warn+0x64/0x80 > [ 0.766666] nvme_reset_work+0x13de/0x1570 > [ 0.766666] ? __switch_to_asm+0x34/0x70 > [ 0.766666] ? __switch_to_asm+0x40/0x70 > [ 0.766666] ? _raw_spin_unlock_irq+0x15/0x30 > [ 0.766666] ? finish_task_switch+0x156/0x210 > [ 0.766666] process_one_work+0x20c/0x3d0 > [ 0.766666] worker_thread+0x216/0x400 > [ 0.766666] kthread+0x125/0x130 > [ 0.766666] ? process_one_work+0x3d0/0x3d0 > [ 0.766666] ? __kthread_bind_mask+0x60/0x60 > [ 0.766666] ret_from_fork+0x3a/0x50 Hello Alexandru, What made you look at cgroups? In the above register dump I see that %rbx == NULL. I think that means that the queue pointer argument of blk_queue_flag_set() is NULL. The NVMe initiator driver should never pass a NULL pointer to blk_mq_quiesce_queue(). Please ask the NVMe driver maintainers for their opinion on the linux-nvme mailing list. Thanks, Bart.
On Wed, 2018-04-11 at 07:56 -0700, Tejun Heo wrote: > And looking at the change, it looks like the right thing we should > have done is caching @lock on the print_blkg side and when switching > locks make sure both locks are held. IOW, do the following in > blk_cleanup_queue() > > spin_lock_irq(lock); > if (q->queue_lock != &q->__queue_lock) { > spin_lock(&q->__queue_lock); > q->queue_lock = &q->__queue_lock; > spin_unlock(&q->__queue_lock); > } > spin_unlock_irq(lock); > > Otherwise, there can be two lock holders thinking they have exclusive > access to the request_queue. I think that's a bad idea. A block driver is allowed to destroy the spinlock it associated with the request queue as soon as blk_cleanup_queue() has finished. If the block cgroup controller would cache a pointer to the block driver spinlock then that could cause the cgroup code to attempt to lock a spinlock after it has been destroyed. I don't think we need that kind of race conditions. Bart.
Hello, On Wed, Apr 11, 2018 at 04:42:55PM +0000, Bart Van Assche wrote: > On Wed, 2018-04-11 at 07:56 -0700, Tejun Heo wrote: > > And looking at the change, it looks like the right thing we should > > have done is caching @lock on the print_blkg side and when switching > > locks make sure both locks are held. IOW, do the following in > > blk_cleanup_queue() > > > > spin_lock_irq(lock); > > if (q->queue_lock != &q->__queue_lock) { > > spin_lock(&q->__queue_lock); > > q->queue_lock = &q->__queue_lock; > > spin_unlock(&q->__queue_lock); > > } > > spin_unlock_irq(lock); > > > > Otherwise, there can be two lock holders thinking they have exclusive > > access to the request_queue. > > I think that's a bad idea. A block driver is allowed to destroy the > spinlock it associated with the request queue as soon as blk_cleanup_queue() > has finished. If the block cgroup controller would cache a pointer to the > block driver spinlock then that could cause the cgroup code to attempt to > lock a spinlock after it has been destroyed. I don't think we need that kind > of race conditions. I see, but that problem is there with or without caching as long as we have queu_lock usage which reach beyond cleanup_queue, right? Whether that user caches the lock for matching unlocking or not doesn't really change the situation. Short of adding protection around queue_lock switching, I can't think of a solution tho. Probably the right thing to do is adding queue lock/unlock helpers which are safe to use beyond cleanup_queue. Thanks.
On Wed, 2018-04-11 at 10:00 -0700, tj@kernel.org wrote: > On Wed, Apr 11, 2018 at 04:42:55PM +0000, Bart Van Assche wrote: > > On Wed, 2018-04-11 at 07:56 -0700, Tejun Heo wrote: > > > And looking at the change, it looks like the right thing we should > > > have done is caching @lock on the print_blkg side and when switching > > > locks make sure both locks are held. IOW, do the following in > > > blk_cleanup_queue() > > > > > > spin_lock_irq(lock); > > > if (q->queue_lock != &q->__queue_lock) { > > > spin_lock(&q->__queue_lock); > > > q->queue_lock = &q->__queue_lock; > > > spin_unlock(&q->__queue_lock); > > > } > > > spin_unlock_irq(lock); > > > > > > Otherwise, there can be two lock holders thinking they have exclusive > > > access to the request_queue. > > > > I think that's a bad idea. A block driver is allowed to destroy the > > spinlock it associated with the request queue as soon as blk_cleanup_queue() > > has finished. If the block cgroup controller would cache a pointer to the > > block driver spinlock then that could cause the cgroup code to attempt to > > lock a spinlock after it has been destroyed. I don't think we need that kind > > of race conditions. > > I see, but that problem is there with or without caching as long as we > have queu_lock usage which reach beyond cleanup_queue, right? Whether > that user caches the lock for matching unlocking or not doesn't really > change the situation. > > Short of adding protection around queue_lock switching, I can't think > of a solution tho. Probably the right thing to do is adding queue > lock/unlock helpers which are safe to use beyond cleanup_queue. Hello Tejun, A simple and effective solution is to dissociate a request queue from the block cgroup controller before blk_cleanup_queue() returns. This is why commit a063057d7c73 ("block: Fix a race between request queue removal and the block cgroup controller") moved the blkcg_exit_queue() call from __blk_release_queue() into blk_cleanup_queue(). Thanks, Bart.
Hello, On Wed, Apr 11, 2018 at 05:06:41PM +0000, Bart Van Assche wrote: > A simple and effective solution is to dissociate a request queue from the > block cgroup controller before blk_cleanup_queue() returns. This is why commit > a063057d7c73 ("block: Fix a race between request queue removal and the block > cgroup controller") moved the blkcg_exit_queue() call from __blk_release_queue() > into blk_cleanup_queue(). which is broken. We can try to switch the lifetime model to revoking all live objects but that likely is a lot more involving than blindly moving blkg shootdown from release to cleanup. Implementing sever semantics is usually a lot more involved / fragile because it requires explicit participation from all users (exactly the same way revoking ->queue_lock is difficult). I'm not necessarily against switching to sever model, but what the patch did isn't that. It just moved some code without actually understanding or auditing what the implications are. Thanks.
On Wed, 2018-04-11 at 10:15 -0700, tj@kernel.org wrote: > On Wed, Apr 11, 2018 at 05:06:41PM +0000, Bart Van Assche wrote: > > A simple and effective solution is to dissociate a request queue from the > > block cgroup controller before blk_cleanup_queue() returns. This is why commit > > a063057d7c73 ("block: Fix a race between request queue removal and the block > > cgroup controller") moved the blkcg_exit_queue() call from __blk_release_queue() > > into blk_cleanup_queue(). > > which is broken. We can try to switch the lifetime model to revoking > all live objects but that likely is a lot more involving than blindly > moving blkg shootdown from release to cleanup. Implementing sever > semantics is usually a lot more involved / fragile because it requires > explicit participation from all users (exactly the same way revoking > ->queue_lock is difficult). > > I'm not necessarily against switching to sever model, but what the > patch did isn't that. It just moved some code without actually > understanding or auditing what the implications are. Hello Tejun, Please explain what you wrote further. It's not clear to me why you think that anything is broken nor what a "sever model" is. I think we really need the blkcg_exit_queue() call in blk_cleanup_queue() to avoid race conditions between request queue cleanup and the block cgroup controller. Additionally, since it is guaranteed that no new requests will be submitted to a queue after it has been marked dead I don't see why it would make sense to keep the association between a request queue and the blkcg controller until the last reference on a queue is dropped. Bart.
Hello, On Wed, Apr 11, 2018 at 05:26:12PM +0000, Bart Van Assche wrote: > Please explain what you wrote further. It's not clear to me why you think > that anything is broken nor what a "sever model" is. So, sever (or revoke) model is where you actively disconnect an object and ensuring that there'll be no further references from others. In contrast, what we usually do is refcnting, where we don't actively sever the dying object from its users but let the users drain themselves over time and destroy the object when we know all the users are gone (recnt reaching zero). > I think we really need the blkcg_exit_queue() call in blk_cleanup_queue() > to avoid race conditions between request queue cleanup and the block cgroup > controller. Additionally, since it is guaranteed that no new requests will > be submitted to a queue after it has been marked dead I don't see why it > would make sense to keep the association between a request queue and the > blkcg controller until the last reference on a queue is dropped. Sure, new requests aren't the only source tho. e.g. there can be derefs through sysfs / cgroupfs or writeback attempts on dead devices. If you want to implement sever, you gotta hunt down all those and make sure they can't make no further derefs. Thanks.
On Wed, Apr 11, 2018 at 03:54:53PM +0000, Bart Van Assche wrote: > On Wed, 2018-04-11 at 16:28 +0200, Alexandru Moise wrote: > > [ 0.766666] BUG: unable to handle kernel NULL pointer dereference at 00000000000001b4 > > [ 0.763350] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 > > [ 0.763350] > > [ 0.766666] PGD 0 P4D 0 > > [ 0.766666] Oops: 0000 [#2] PREEMPT SMP > > [ 0.766666] CPU: 0 PID: 6 Comm: kworker/u12:0 Tainted: G D 4.16.0-ARCH+ #81 > > [ 0.766666] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/20$ > > [ 0.766666] Workqueue: nvme-reset-wq nvme_reset_work > > [ 0.766666] RIP: 0010:blk_queue_flag_set+0xf/0x40 > > [ 0.766666] RSP: 0000:ffffc900001bfcb0 EFLAGS: 00010246 > > [ 0.766666] RAX: ffff88003b698000 RBX: 0000000000000000 RCX: 0000000000000000 > > [ 0.766666] RDX: ffff88003b698000 RSI: fffffffffffffff4 RDI: 000000000000001c > > [ 0.766666] RBP: ffffc900001bfcc0 R08: 0000000000000000 R09: 0000000000000000 > > [ 0.766666] R10: ffffea0000eaa980 R11: ffffffff814e0970 R12: 000000000000001c > > [ 0.766666] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88003aad8010 > > [ 0.766666] FS: 0000000000000000(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000 > > [ 0.766666] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 0.766666] CR2: 00000000000001b4 CR3: 0000000002209001 CR4: 00000000000606f0 > > [ 0.766666] Call Trace: > > [ 0.766666] blk_mq_quiesce_queue+0x23/0x80 > > [ 0.766666] nvme_dev_disable+0x34f/0x480 > > [ 0.766666] ? nvme_irq+0x50/0x50 > > [ 0.766666] ? dev_warn+0x64/0x80 > > [ 0.766666] nvme_reset_work+0x13de/0x1570 > > [ 0.766666] ? __switch_to_asm+0x34/0x70 > > [ 0.766666] ? __switch_to_asm+0x40/0x70 > > [ 0.766666] ? _raw_spin_unlock_irq+0x15/0x30 > > [ 0.766666] ? finish_task_switch+0x156/0x210 > > [ 0.766666] process_one_work+0x20c/0x3d0 > > [ 0.766666] worker_thread+0x216/0x400 > > [ 0.766666] kthread+0x125/0x130 > > [ 0.766666] ? process_one_work+0x3d0/0x3d0 > > [ 0.766666] ? __kthread_bind_mask+0x60/0x60 > > [ 0.766666] ret_from_fork+0x3a/0x50 > > Hello Alexandru, > > What made you look at cgroups? In the above register dump I see that %rbx == NULL. > I think that means that the queue pointer argument of blk_queue_flag_set() is NULL. > The NVMe initiator driver should never pass a NULL pointer to blk_mq_quiesce_queue(). > Please ask the NVMe driver maintainers for their opinion on the linux-nvme mailing > list. > > Thanks, > > Bart. The %rbx == NULL is only a symptom of the cgroup mishandling, perhaps we could improve error handling in the NVMe driver, but I can say that about a lot of block drivers actually, perhaps I will write some patches in the future to improve the error handling. But the root cause of it is in blkcg_init_queue() when blkg_create() returns an ERR ptr, because it tries to insert into a populated index into blkcg->blkg_tree, the entry that we fail to remove at __blk_release_queue(). Thanks, ../Alex > > >
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 1c16694ae145..224e937dbb59 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -369,7 +369,7 @@ static void blkg_destroy(struct blkcg_gq *blkg) * * Destroy all blkgs associated with @q. */ -static void blkg_destroy_all(struct request_queue *q) +void blkg_destroy_all(struct request_queue *q) { struct blkcg_gq *blkg, *n; diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index d00d1b0ec109..a72866458f22 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -816,6 +816,10 @@ static void __blk_release_queue(struct work_struct *work) if (q->bio_split) bioset_free(q->bio_split); + spin_lock_irq(q->queue_lock); + blkg_destroy_all(q); + spin_unlock_irq(q->queue_lock); + ida_simple_remove(&blk_queue_ida, q->id); call_rcu(&q->rcu_head, blk_free_queue_rcu); } diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index 6c666fd7de3c..3d60b1d1973d 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -179,6 +179,8 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg, int blkcg_init_queue(struct request_queue *q); void blkcg_drain_queue(struct request_queue *q); void blkcg_exit_queue(struct request_queue *q); +void blkg_destroy_all(struct request_queue *q); + /* Blkio controller policy registration */ int blkcg_policy_register(struct blkcg_policy *pol); @@ -740,6 +742,7 @@ static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { ret static inline int blkcg_init_queue(struct request_queue *q) { return 0; } static inline void blkcg_drain_queue(struct request_queue *q) { } static inline void blkcg_exit_queue(struct request_queue *q) { } +static inline void blkg_destroy_all(struct request_queue *q) { } static inline int blkcg_policy_register(struct blkcg_policy *pol) { return 0; } static inline void blkcg_policy_unregister(struct blkcg_policy *pol) { } static inline int blkcg_activate_policy(struct request_queue *q,
The q->id is used as an index within the blkg_tree radix tree. If the entry is not released before reclaiming the blk_queue_ida's id blkcg_init_queue() within a different driver from which this id was originally for can fail due to the entry at that index within the radix tree already existing. Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> --- v2: Added no-op for !CONFIG_BLK_CGROUP block/blk-cgroup.c | 2 +- block/blk-sysfs.c | 4 ++++ include/linux/blk-cgroup.h | 3 +++ 3 files changed, 8 insertions(+), 1 deletion(-)