Message ID | 20200414041902.16769-4-mcgrof@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | blktrace: fix use after free | expand |
On Tue, Apr 14, 2020 at 04:19:00AM +0000, Luis Chamberlain wrote: > Ensure that the request_queue is refcounted during its full > ioctl cycle. This avoids possible races against removal, given > blk_get_queue() also checks to ensure the queue is not dying. > > This small race is possible if you defer removal of the request_queue > and userspace fires off an ioctl for the device in the meantime. Hmm, where exactly does the race come in so that it can only happen after where you take the reference, but not before it? I'm probably missing something, but that just means it needs to be explained a little better :)
On Tue, Apr 14, 2020 at 08:40:44AM -0700, Christoph Hellwig wrote: > On Tue, Apr 14, 2020 at 04:19:00AM +0000, Luis Chamberlain wrote: > > Ensure that the request_queue is refcounted during its full > > ioctl cycle. This avoids possible races against removal, given > > blk_get_queue() also checks to ensure the queue is not dying. > > > > This small race is possible if you defer removal of the request_queue > > and userspace fires off an ioctl for the device in the meantime. > > Hmm, where exactly does the race come in so that it can only happen > after where you take the reference, but not before it? I'm probably > missing something, but that just means it needs to be explained a little > better :) From the trace on patch 2/5: BLKTRACE_SETUP(loop0) #2 [ 13.933961] == blk_trace_ioctl(2, BLKTRACESETUP) start [ 13.936758] === do_blk_trace_setup(2) start [ 13.938944] === do_blk_trace_setup(2) creating directory [ 13.941029] === do_blk_trace_setup(2) using what debugfs_lookup() gave ---> From LOOP_CTL_DEL(loop0) #2 [ 13.971046] === blk_trace_cleanup(7) end [ 13.973175] == __blk_trace_remove(7) end [ 13.975352] == blk_trace_shutdown(7) end [ 13.977415] = __blk_release_queue(7) calling blk_mq_debugfs_unregister() [ 13.980645] ==== blk_mq_debugfs_unregister(7) begin [ 13.980696] ==== blk_mq_debugfs_unregister(7) debugfs_remove_recursive(q->debugfs_dir) [ 13.983118] ==== blk_mq_debugfs_unregister(7) end q->debugfs_dir is NULL [ 13.986945] = __blk_release_queue(7) blk_mq_debugfs_unregister() end [ 13.993155] = __blk_release_queue(7) end ---> From BLKTRACE_SETUP(loop0) #2 [ 13.995928] === do_blk_trace_setup(2) end with ret: 0 [ 13.997623] == blk_trace_ioctl(2, BLKTRACESETUP) end The BLKTRACESETUP above works on request_queue which later LOOP_CTL_DEL races on and sweeps the debugfs dir underneath us. If you use this commit alone though, this doesn't fix the race issue however, and that's because of both still the debugfs_lookup() use and that we're still using asynchronous removal at this point. refcounting will just ensure we don't take the request_queue underneath our noses. Should I just add this to the commit log? Luis
On Wed, Apr 15, 2020 at 06:16:49AM +0000, Luis Chamberlain wrote: > The BLKTRACESETUP above works on request_queue which later > LOOP_CTL_DEL races on and sweeps the debugfs dir underneath us. > If you use this commit alone though, this doesn't fix the race issue > however, and that's because of both still the debugfs_lookup() use > and that we're still using asynchronous removal at this point. > > refcounting will just ensure we don't take the request_queue underneath > our noses. > > Should I just add this to the commit log? That sounds much more useful than the trace. Btw, Isn't blk_get_queue racy as well? Shouldn't we check blk_queue_dying after getting the reference and undo it if the queue is indeeed dying?
On Wed, Apr 15, 2020 at 12:14:25AM -0700, Christoph Hellwig wrote: > On Wed, Apr 15, 2020 at 06:16:49AM +0000, Luis Chamberlain wrote: > > The BLKTRACESETUP above works on request_queue which later > > LOOP_CTL_DEL races on and sweeps the debugfs dir underneath us. > > If you use this commit alone though, this doesn't fix the race issue > > however, and that's because of both still the debugfs_lookup() use > > and that we're still using asynchronous removal at this point. > > > > refcounting will just ensure we don't take the request_queue underneath > > our noses. > > > > Should I just add this to the commit log? > > That sounds much more useful than the trace. > > Btw, Isn't blk_get_queue racy as well? Shouldn't we check > blk_queue_dying after getting the reference and undo it if the queue is > indeeed dying? Yes that race should be possible: bool blk_get_queue(struct request_queue *q) { if (likely(!blk_queue_dying(q))) { ----------> we can get the queue to go dying here <--------- __blk_get_queue(q); return true; } return false; } EXPORT_SYMBOL(blk_get_queue); I'll pile up a fix. I've also considered doing a full review of callers outside of the core block layer using it, and maybe just unexporting this. It was originally exported due to commit d86e0e83b ("block: export blk_{get,put}_queue()") to fix a scsi bug, but I can't find such respective fix. I suspec that using bdgrab()/bdput() seems more likely what drivers should be using. That would allow us to keep this functionality internal. Think that's worthy review? Luis
On Wed, Apr 15, 2020 at 12:34:34PM +0000, Luis Chamberlain wrote: > I'll pile up a fix. I've also considered doing a full review of callers > outside of the core block layer using it, and maybe just unexporting > this. It was originally exported due to commit d86e0e83b ("block: export > blk_{get,put}_queue()") to fix a scsi bug, but I can't find such > respective fix. I suspec that using bdgrab()/bdput() seems more likely > what drivers should be using. That would allow us to keep this > functionality internal. > > Think that's worthy review? Probably. I did in fact very quickly look into that but then gave up due to the fair amount of modular users.
On Wed, Apr 15, 2020 at 05:39:25AM -0700, Christoph Hellwig wrote: > On Wed, Apr 15, 2020 at 12:34:34PM +0000, Luis Chamberlain wrote: > > I'll pile up a fix. I've also considered doing a full review of callers > > outside of the core block layer using it, and maybe just unexporting > > this. It was originally exported due to commit d86e0e83b ("block: export > > blk_{get,put}_queue()") to fix a scsi bug, but I can't find such > > respective fix. I suspec that using bdgrab()/bdput() seems more likely > > what drivers should be using. That would allow us to keep this > > functionality internal. > > > > Think that's worthy review? > > Probably. I did in fact very quickly look into that but then gave > up due to the fair amount of modular users. Alright, then might as well then verify if the existing practice of bdgrab()/bdput() is indeed valid logic, as otherwise we'd be puting the atomic context / sleep concern to bdput(). As noted earlier I am able to confirm easily that bdgrab() can be called in atomic contex, however I cannot easily yet vet for *why* this was a safe assumption for bdput(). Luis
On 2020-04-15 05:34, Luis Chamberlain wrote: > On Wed, Apr 15, 2020 at 12:14:25AM -0700, Christoph Hellwig wrote: >> Btw, Isn't blk_get_queue racy as well? Shouldn't we check >> blk_queue_dying after getting the reference and undo it if the queue is >> indeeed dying? > > Yes that race should be possible: > > bool blk_get_queue(struct request_queue *q) > { > if (likely(!blk_queue_dying(q))) { > ----------> we can get the queue to go dying here <--------- > __blk_get_queue(q); > return true; > } > > return false; > } > EXPORT_SYMBOL(blk_get_queue); > > I'll pile up a fix. I've also considered doing a full review of callers > outside of the core block layer using it, and maybe just unexporting > this. It was originally exported due to commit d86e0e83b ("block: export > blk_{get,put}_queue()") to fix a scsi bug, but I can't find such > respective fix. I suspec that using bdgrab()/bdput() seems more likely > what drivers should be using. That would allow us to keep this > functionality internal. blk_get_queue() prevents concurrent freeing of struct request_queue but does not prevent concurrent blk_cleanup_queue() calls. Callers of blk_get_queue() may encounter a change of the queue state from normal into dying any time during the blk_get_queue() call or after blk_get_queue() has finished. Maybe I'm overlooking something but I doubt that modifying blk_get_queue() will help. Bart.
On 2020-04-14 23:16, Luis Chamberlain wrote: > On Tue, Apr 14, 2020 at 08:40:44AM -0700, Christoph Hellwig wrote: >> Hmm, where exactly does the race come in so that it can only happen >> after where you take the reference, but not before it? I'm probably >> missing something, but that just means it needs to be explained a little >> better :) > >>From the trace on patch 2/5: > > BLKTRACE_SETUP(loop0) #2 > [ 13.933961] == blk_trace_ioctl(2, BLKTRACESETUP) start > [ 13.936758] === do_blk_trace_setup(2) start > [ 13.938944] === do_blk_trace_setup(2) creating directory > [ 13.941029] === do_blk_trace_setup(2) using what debugfs_lookup() gave > > ---> From LOOP_CTL_DEL(loop0) #2 > [ 13.971046] === blk_trace_cleanup(7) end > [ 13.973175] == __blk_trace_remove(7) end > [ 13.975352] == blk_trace_shutdown(7) end > [ 13.977415] = __blk_release_queue(7) calling blk_mq_debugfs_unregister() > [ 13.980645] ==== blk_mq_debugfs_unregister(7) begin > [ 13.980696] ==== blk_mq_debugfs_unregister(7) debugfs_remove_recursive(q->debugfs_dir) > [ 13.983118] ==== blk_mq_debugfs_unregister(7) end q->debugfs_dir is NULL > [ 13.986945] = __blk_release_queue(7) blk_mq_debugfs_unregister() end > [ 13.993155] = __blk_release_queue(7) end > > ---> From BLKTRACE_SETUP(loop0) #2 > [ 13.995928] === do_blk_trace_setup(2) end with ret: 0 > [ 13.997623] == blk_trace_ioctl(2, BLKTRACESETUP) end > > The BLKTRACESETUP above works on request_queue which later > LOOP_CTL_DEL races on and sweeps the debugfs dir underneath us. > If you use this commit alone though, this doesn't fix the race issue > however, and that's because of both still the debugfs_lookup() use > and that we're still using asynchronous removal at this point. > > refcounting will just ensure we don't take the request_queue underneath > our noses. I think the above trace reveals a bug in the loop driver. The loop driver shouldn't allow the associated request queue to disappear while the loop device is open. One may want to have a look at sd_open() in the sd driver. The scsi_disk_get() call in that function not only increases the reference count of the SCSI disk but also of the underlying SCSI device. Thanks, Bart.
On Wed, Apr 15, 2020 at 07:18:22AM -0700, Bart Van Assche wrote: > On 2020-04-15 05:34, Luis Chamberlain wrote: > > On Wed, Apr 15, 2020 at 12:14:25AM -0700, Christoph Hellwig wrote: > >> Btw, Isn't blk_get_queue racy as well? Shouldn't we check > >> blk_queue_dying after getting the reference and undo it if the queue is > >> indeeed dying? > > > > Yes that race should be possible: > > > > bool blk_get_queue(struct request_queue *q) > > { > > if (likely(!blk_queue_dying(q))) { > > ----------> we can get the queue to go dying here <--------- > > __blk_get_queue(q); > > return true; > > } > > > > return false; > > } > > EXPORT_SYMBOL(blk_get_queue); > > > > I'll pile up a fix. I've also considered doing a full review of callers > > outside of the core block layer using it, and maybe just unexporting > > this. It was originally exported due to commit d86e0e83b ("block: export > > blk_{get,put}_queue()") to fix a scsi bug, but I can't find such > > respective fix. I suspec that using bdgrab()/bdput() seems more likely > > what drivers should be using. That would allow us to keep this > > functionality internal. > > blk_get_queue() prevents concurrent freeing of struct request_queue but > does not prevent concurrent blk_cleanup_queue() calls. Wouldn't concurrent blk_cleanup_queue() calls be a bug? If so should I make it clear that it would be or simply prevent it? > Callers of > blk_get_queue() may encounter a change of the queue state from normal > into dying any time during the blk_get_queue() call or after > blk_get_queue() has finished. Maybe I'm overlooking something but I > doubt that modifying blk_get_queue() will help. Good point, to fix that race described by Christoph we'd have to take into consideration refcounts of the request_queue to prevent queues from changing state to dying if the refcount is > 1, however that'd also would mean not allowing the request_queue from ever dying. One way we could resolve this could be to to keep track of a quiesce/dying request, then at that point prevent blk_get_queue() from allowing increments, and once the refcount is down to 1, flip the switch to dying. Luis
On Wed, Apr 15, 2020 at 07:45:18AM -0700, Bart Van Assche wrote: > On 2020-04-14 23:16, Luis Chamberlain wrote: > > On Tue, Apr 14, 2020 at 08:40:44AM -0700, Christoph Hellwig wrote: > >> Hmm, where exactly does the race come in so that it can only happen > >> after where you take the reference, but not before it? I'm probably > >> missing something, but that just means it needs to be explained a little > >> better :) > > > >>From the trace on patch 2/5: > > > > BLKTRACE_SETUP(loop0) #2 > > [ 13.933961] == blk_trace_ioctl(2, BLKTRACESETUP) start > > [ 13.936758] === do_blk_trace_setup(2) start > > [ 13.938944] === do_blk_trace_setup(2) creating directory > > [ 13.941029] === do_blk_trace_setup(2) using what debugfs_lookup() gave > > > > ---> From LOOP_CTL_DEL(loop0) #2 > > [ 13.971046] === blk_trace_cleanup(7) end > > [ 13.973175] == __blk_trace_remove(7) end > > [ 13.975352] == blk_trace_shutdown(7) end > > [ 13.977415] = __blk_release_queue(7) calling blk_mq_debugfs_unregister() > > [ 13.980645] ==== blk_mq_debugfs_unregister(7) begin > > [ 13.980696] ==== blk_mq_debugfs_unregister(7) debugfs_remove_recursive(q->debugfs_dir) > > [ 13.983118] ==== blk_mq_debugfs_unregister(7) end q->debugfs_dir is NULL > > [ 13.986945] = __blk_release_queue(7) blk_mq_debugfs_unregister() end > > [ 13.993155] = __blk_release_queue(7) end > > > > ---> From BLKTRACE_SETUP(loop0) #2 > > [ 13.995928] === do_blk_trace_setup(2) end with ret: 0 > > [ 13.997623] == blk_trace_ioctl(2, BLKTRACESETUP) end > > > > The BLKTRACESETUP above works on request_queue which later > > LOOP_CTL_DEL races on and sweeps the debugfs dir underneath us. > > If you use this commit alone though, this doesn't fix the race issue > > however, and that's because of both still the debugfs_lookup() use > > and that we're still using asynchronous removal at this point. > > > > refcounting will just ensure we don't take the request_queue underneath > > our noses. > > I think the above trace reveals a bug in the loop driver. The loop > driver shouldn't allow the associated request queue to disappear while > the loop device is open. The bug was *not* in the driver, the bug was in that deferal of removal was allowed to be asynchronous, therefore the removal from a userspace perspective *finishes*, but its not actually really done. Back when the removal was synchronous, the loop driver waited on cleanup, and didn't return to userspace until it was really removed. This is why I annotated that the move to asynch removal turns out to actually be a userspace API regression. > One may want to have a look at sd_open() in the > sd driver. The scsi_disk_get() call in that function not only increases > the reference count of the SCSI disk but also of the underlying SCSI device. Are you saying to use this as a template for what a driver should do or do you suspect there is a bug there? Not sure what you mean here. Luis
On Tue, Apr 14, 2020 at 04:19:00AM +0000, Luis Chamberlain wrote: > Ensure that the request_queue is refcounted during its full > ioctl cycle. This avoids possible races against removal, given > blk_get_queue() also checks to ensure the queue is not dying. > > This small race is possible if you defer removal of the request_queue > and userspace fires off an ioctl for the device in the meantime. > > Cc: Bart Van Assche <bvanassche@acm.org> > Cc: Omar Sandoval <osandov@fb.com> > Cc: Hannes Reinecke <hare@suse.com> > Cc: Nicolai Stange <nstange@suse.de> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Cc: Michal Hocko <mhocko@kernel.org> > Cc: yu kuai <yukuai3@huawei.com> > Reviewed-by: Bart Van Assche <bvanassche@acm.org> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > --- > kernel/trace/blktrace.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c > index 15086227592f..17e144d15779 100644 > --- a/kernel/trace/blktrace.c > +++ b/kernel/trace/blktrace.c > @@ -701,6 +701,9 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg) > if (!q) > return -ENXIO; > > + if (!blk_get_queue(q)) > + return -ENXIO; > + > mutex_lock(&q->blk_trace_mutex); > > switch (cmd) { > @@ -729,6 +732,9 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg) > } > > mutex_unlock(&q->blk_trace_mutex); > + > + blk_put_queue(q); > + > return ret; > } Actually when bdev is opened, one extra refcount is held on gendisk, so gendisk won't go away. And __device_add_disk() does grab one extra refcount on request queue, so request queue shouldn't go away when ioctl is running. Can you describe a bit more what the issue is to be addressed by this patch? Thanks, Ming
On 2020-04-15 18:12, Luis Chamberlain wrote: > On Wed, Apr 15, 2020 at 07:18:22AM -0700, Bart Van Assche wrote: >> blk_get_queue() prevents concurrent freeing of struct request_queue but >> does not prevent concurrent blk_cleanup_queue() calls. > > Wouldn't concurrent blk_cleanup_queue() calls be a bug? If so should > I make it clear that it would be or simply prevent it? I think calling blk_cleanup_queue() while the queue refcount > 0 is well established behavior. At least the SCSI core triggers that behavior since a very long time. I prefer not to change that behavior. Regarding patch 3/5: how about dropping that patch? If the queue refcount can drop to zero while blk_trace_ioctl() is in progress I think that should be fixed in the block_device_operations.open callback instead of in blk_trace_ioctl(). Thanks, Bart.
On Wed, Apr 15, 2020 at 08:43:32PM -0700, Bart Van Assche wrote: > On 2020-04-15 18:12, Luis Chamberlain wrote: > > On Wed, Apr 15, 2020 at 07:18:22AM -0700, Bart Van Assche wrote: > >> blk_get_queue() prevents concurrent freeing of struct request_queue but > >> does not prevent concurrent blk_cleanup_queue() calls. > > > > Wouldn't concurrent blk_cleanup_queue() calls be a bug? If so should > > I make it clear that it would be or simply prevent it? > > I think calling blk_cleanup_queue() while the queue refcount > 0 is well > established behavior. At least the SCSI core triggers that behavior > since a very long time. I prefer not to change that behavior. I see. An alternative is to simply check if we already are cleaning up and if so abort early on the blk_cleanup_queue(). That would allow re-entrant calls, and just be a no-op to the additional calls. Or is the re-entrant, two attemps to really do all the work blk_cleanup_queue() expected functionality already? > Regarding patch 3/5: how about dropping that patch? If the queue > refcount can drop to zero while blk_trace_ioctl() is in progress I think > that should be fixed in the block_device_operations.open callback > instead of in blk_trace_ioctl(). I'll take a look, thanks! Luis
On Thu, Apr 16, 2020 at 10:31:22AM +0800, Ming Lei wrote: > On Tue, Apr 14, 2020 at 04:19:00AM +0000, Luis Chamberlain wrote: > > Ensure that the request_queue is refcounted during its full > > ioctl cycle. This avoids possible races against removal, given > > blk_get_queue() also checks to ensure the queue is not dying. > > > > This small race is possible if you defer removal of the request_queue > > and userspace fires off an ioctl for the device in the meantime. > > > > Cc: Bart Van Assche <bvanassche@acm.org> > > Cc: Omar Sandoval <osandov@fb.com> > > Cc: Hannes Reinecke <hare@suse.com> > > Cc: Nicolai Stange <nstange@suse.de> > > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > Cc: Michal Hocko <mhocko@kernel.org> > > Cc: yu kuai <yukuai3@huawei.com> > > Reviewed-by: Bart Van Assche <bvanassche@acm.org> > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > > --- > > kernel/trace/blktrace.c | 6 ++++++ > > 1 file changed, 6 insertions(+) > > > > diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c > > index 15086227592f..17e144d15779 100644 > > --- a/kernel/trace/blktrace.c > > +++ b/kernel/trace/blktrace.c > > @@ -701,6 +701,9 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg) > > if (!q) > > return -ENXIO; > > > > + if (!blk_get_queue(q)) > > + return -ENXIO; > > + > > mutex_lock(&q->blk_trace_mutex); > > > > switch (cmd) { > > @@ -729,6 +732,9 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg) > > } > > > > mutex_unlock(&q->blk_trace_mutex); > > + > > + blk_put_queue(q); > > + > > return ret; > > } > > Actually when bdev is opened, one extra refcount is held on gendisk, so > gendisk won't go away. And __device_add_disk() does grab one extra > refcount on request queue, so request queue shouldn't go away when ioctl > is running. Alright, then yes, this should not be needed. Luis
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c index 15086227592f..17e144d15779 100644 --- a/kernel/trace/blktrace.c +++ b/kernel/trace/blktrace.c @@ -701,6 +701,9 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg) if (!q) return -ENXIO; + if (!blk_get_queue(q)) + return -ENXIO; + mutex_lock(&q->blk_trace_mutex); switch (cmd) { @@ -729,6 +732,9 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg) } mutex_unlock(&q->blk_trace_mutex); + + blk_put_queue(q); + return ret; }