diff mbox series

[v3] block: genhd: don't call probe function with major_names_lock held

Message ID 4e153910-bf60-2cca-fa02-b46d22b6e2c5@i-love.sakura.ne.jp (mailing list archive)
State New, archived
Headers show
Series [v3] block: genhd: don't call probe function with major_names_lock held | expand

Commit Message

Tetsuo Handa Aug. 15, 2021, 6:52 a.m. UTC
syzbot is reporting circular locking problem at __loop_clr_fd() [1], for
commit a160c6159d4a0cf8 ("block: add an optional probe callback to
major_names") is calling the module's probe function with major_names_lock
held.

When copying content of /proc/devices to another file via sendfile(),

  sb_writers#$N => &p->lock => major_names_lock

dependency is recorded.

When loop_process_work() from WQ context performs a write request,

  (wq_completion)loop$M => (work_completion)&lo->rootcg_work =>
  sb_writers#$N

dependency is recorded.

When flush_workqueue() from drain_workqueue() from destroy_workqueue()
 from __loop_clr_fd() from blkdev_put() from blkdev_close() from __fput()
is called,

  &disk->open_mutex => &lo->lo_mutex => (wq_completion)loop$M

dependency is recorded.

When loop_control_remove() from loop_control_ioctl(LOOP_CTL_REMOVE) is
called,

  loop_ctl_mutex => &lo->lo_mutex

dependency is recorded.

As a result, lockdep thinks that there is

  loop_ctl_mutex => &lo->lo_mutex => (wq_completion)loop$M =>
  (work_completion)&lo->rootcg_work => sb_writers#$N => &p->lock =>
  major_names_lock

dependency chain.

Then, if loop_add() from loop_probe() from blk_request_module() from
blkdev_get_no_open() from blkdev_get_by_dev() from blkdev_open() from
do_dentry_open() from path_openat() from do_filp_open() is called,

  major_names_lock => loop_ctl_mutex

dependency is appended to the dependency chain.

There would be two approaches for breaking this circular dependency.
One is to kill loop_ctl_mutex => &lo->lo_mutex chain. The other is to
kill major_names_lock => loop_ctl_mutex chain. This patch implements
the latter, due to the following reasons.

 (1) sb_writers#$N => &p->lock => major_names_lock chain is unavoidable

 (2) this patch can also fix similar problem in other modules [2] which
     is caused by calling the probe function with major_names_lock held

 (3) I believe that this patch is principally safer than e.g.
     commit bd5c39edad535d9f ("loop: reduce loop_ctl_mutex coverage in
     loop_exit") which waits until the probe function finishes using
     global mutex in order to fix deadlock reproducible by sleep
     injection [3]

This patch adds THIS_MODULE parameter to __register_blkdev(), based on an
assumption that a probe callback passed to __register_blkdev() belongs to
a module which calls __register_blkdev(). Then, drop major_names_lock
before calling probe function by holding a reference to that module which
contains that probe function.

It may sound strange to pass THIS_MODULE as a function argument, but
what this patch is doing is essentially the same with passing e.g.
"struct file_system_type" argument initialized with .owner = THIS_MODULE
to register_filesystem(). To minimize lines changed, this patch does not
define some "struct" for __register_blkdev().

Link: https://syzkaller.appspot.com/bug?id=7bb10e8b62f83e4d445cdf4c13d69e407e629558 [1]
Link: https://syzkaller.appspot.com/bug?id=7bd106c28e846d1023d4ca915718b1a0905444cb [2]
Link: https://lkml.kernel.org/r/c4edf07f-92e1-a350-2743-f0b0234a2b6c@i-love.sakura.ne.jp [3]
Reported-by: syzbot <syzbot+f61766d5763f9e7a118f@syzkaller.appspotmail.com>
Reported-by: syzbot <syzbot+6a8a0d93c91e8fbf2e80@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: syzbot <syzbot+f61766d5763f9e7a118f@syzkaller.appspotmail.com>
Fixes: a160c6159d4a0cf8 ("block: add an optional probe callback to major_names")
---
 block/genhd.c           | 32 +++++++++++++++++++++++++-------
 drivers/block/ataflop.c |  2 +-
 drivers/block/brd.c     |  2 +-
 drivers/block/floppy.c  |  2 +-
 drivers/block/loop.c    |  2 +-
 drivers/md/md.c         |  4 ++--
 drivers/scsi/sd.c       |  2 +-
 include/linux/genhd.h   |  4 ++--
 8 files changed, 34 insertions(+), 16 deletions(-)

Comments

Greg KH Aug. 15, 2021, 7:06 a.m. UTC | #1
On Sun, Aug 15, 2021 at 03:52:45PM +0900, Tetsuo Handa wrote:
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -303,9 +303,9 @@ struct gendisk *__blk_alloc_disk(int node);
>  void blk_cleanup_disk(struct gendisk *disk);
>  
>  int __register_blkdev(unsigned int major, const char *name,
> -		void (*probe)(dev_t devt));
> +		      void (*probe)(dev_t devt), struct module *owner);
>  #define register_blkdev(major, name) \
> -	__register_blkdev(major, name, NULL)
> +	__register_blkdev(major, name, NULL, NULL)
>  void unregister_blkdev(unsigned int major, const char *name);

Do not force modules to put their own THIS_MODULE macro as a parameter,
put it in the .h file so that it happens automagically, much like the
usb_register() define in include/linux/usb.h is created.

If you do that, you can probably get rid of the __register_blkdev()
direct calls as well...

thanks,

greg k-h
Tetsuo Handa Aug. 15, 2021, 7:49 a.m. UTC | #2
On 2021/08/15 16:06, Greg KH wrote:
> On Sun, Aug 15, 2021 at 03:52:45PM +0900, Tetsuo Handa wrote:
>> --- a/include/linux/genhd.h
>> +++ b/include/linux/genhd.h
>> @@ -303,9 +303,9 @@ struct gendisk *__blk_alloc_disk(int node);
>>  void blk_cleanup_disk(struct gendisk *disk);
>>  
>>  int __register_blkdev(unsigned int major, const char *name,
>> -		void (*probe)(dev_t devt));
>> +		      void (*probe)(dev_t devt), struct module *owner);
>>  #define register_blkdev(major, name) \
>> -	__register_blkdev(major, name, NULL)
>> +	__register_blkdev(major, name, NULL, NULL)
>>  void unregister_blkdev(unsigned int major, const char *name);
> 
> Do not force modules to put their own THIS_MODULE macro as a parameter,
> put it in the .h file so that it happens automagically, much like the
> usb_register() define in include/linux/usb.h is created.

Sure. We can do like below.

diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 13b34177cc85..70f00641fa11 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -302,10 +302,12 @@ extern void put_disk(struct gendisk *disk);
 struct gendisk *__blk_alloc_disk(int node);
 void blk_cleanup_disk(struct gendisk *disk);
 
-int __register_blkdev(unsigned int major, const char *name,
-		void (*probe)(dev_t devt));
+int ____register_blkdev(unsigned int major, const char *name,
+			void (*probe)(dev_t devt), struct module *owner);
+#define __register_blkdev(major, name, probe) \
+	____register_blkdev(major, name, probe, THIS_MODULE)
 #define register_blkdev(major, name) \
-	__register_blkdev(major, name, NULL)
+	____register_blkdev(major, name, NULL, NULL)
 void unregister_blkdev(unsigned int major, const char *name);
 
 bool bdev_check_media_change(struct block_device *bdev);

> 
> If you do that, you can probably get rid of the __register_blkdev()
> direct calls as well...

I assume "automagically" implies "do not patch individual unregister_blkdev() /
__register_blkdev() callers". But "removing __register_blkdev() direct calls"
requires "patching individual __register_blkdev() callers". I didn't catch
what you suggested here.

Anyway, since this patch should be backported to 5.11+ kernels, lines changed
should be kept minimal. We can do whatever remapping after this patch.
Greg KH Aug. 15, 2021, 9:19 a.m. UTC | #3
On Sun, Aug 15, 2021 at 04:49:55PM +0900, Tetsuo Handa wrote:
> On 2021/08/15 16:06, Greg KH wrote:
> > On Sun, Aug 15, 2021 at 03:52:45PM +0900, Tetsuo Handa wrote:
> >> --- a/include/linux/genhd.h
> >> +++ b/include/linux/genhd.h
> >> @@ -303,9 +303,9 @@ struct gendisk *__blk_alloc_disk(int node);
> >>  void blk_cleanup_disk(struct gendisk *disk);
> >>  
> >>  int __register_blkdev(unsigned int major, const char *name,
> >> -		void (*probe)(dev_t devt));
> >> +		      void (*probe)(dev_t devt), struct module *owner);
> >>  #define register_blkdev(major, name) \
> >> -	__register_blkdev(major, name, NULL)
> >> +	__register_blkdev(major, name, NULL, NULL)
> >>  void unregister_blkdev(unsigned int major, const char *name);
> > 
> > Do not force modules to put their own THIS_MODULE macro as a parameter,
> > put it in the .h file so that it happens automagically, much like the
> > usb_register() define in include/linux/usb.h is created.
> 
> Sure. We can do like below.
> 
> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> index 13b34177cc85..70f00641fa11 100644
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -302,10 +302,12 @@ extern void put_disk(struct gendisk *disk);
>  struct gendisk *__blk_alloc_disk(int node);
>  void blk_cleanup_disk(struct gendisk *disk);
>  
> -int __register_blkdev(unsigned int major, const char *name,
> -		void (*probe)(dev_t devt));
> +int ____register_blkdev(unsigned int major, const char *name,
> +			void (*probe)(dev_t devt), struct module *owner);
> +#define __register_blkdev(major, name, probe) \
> +	____register_blkdev(major, name, probe, THIS_MODULE)
>  #define register_blkdev(major, name) \
> -	__register_blkdev(major, name, NULL)
> +	____register_blkdev(major, name, NULL, NULL)

Why stop at 4 _ characters?

{sigh}

I think you need a new naming scheme here...

> > 
> > If you do that, you can probably get rid of the __register_blkdev()
> > direct calls as well...
> 
> I assume "automagically" implies "do not patch individual unregister_blkdev() /
> __register_blkdev() callers". But "removing __register_blkdev() direct calls"
> requires "patching individual __register_blkdev() callers". I didn't catch
> what you suggested here.

Yes, that is a different thing, sorry for the confusion.

> Anyway, since this patch should be backported to 5.11+ kernels, lines changed
> should be kept minimal. We can do whatever remapping after this patch.

Don't worry about backports, fix it properly first.

thanks,

greg k-h
Christoph Hellwig Aug. 16, 2021, 7:33 a.m. UTC | #4
This is the wrong way to approach it.  Instead of making things ever
more complex try to make it simpler.  In this case, ensure that the
destroy_workqueue is not held with pointless locks.  Given that the
loop device already is known to not have a reference and marked as in
the rundown state there shouldn't be anything that is required to
be protected by lo_mutex.  So something like this untested patch
should probably do the work:

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index fa1c298a8cfb..c734dc768316 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1347,16 +1347,15 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
 	 * became visible.
 	 */
 
-	mutex_lock(&lo->lo_mutex);
 	if (WARN_ON_ONCE(lo->lo_state != Lo_rundown)) {
 		err = -ENXIO;
-		goto out_unlock;
+		goto out;
 	}
 
 	filp = lo->lo_backing_file;
 	if (filp == NULL) {
 		err = -EINVAL;
-		goto out_unlock;
+		goto out;
 	}
 
 	if (test_bit(QUEUE_FLAG_WC, &lo->lo_queue->queue_flags))
@@ -1366,6 +1365,8 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
 	blk_mq_freeze_queue(lo->lo_queue);
 
 	destroy_workqueue(lo->workqueue);
+
+	mutex_lock(&lo->lo_mutex);
 	spin_lock_irq(&lo->lo_work_lock);
 	list_for_each_entry_safe(worker, pos, &lo->idle_worker_list,
 				idle_list) {
@@ -1413,8 +1414,8 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
 	partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev;
 	lo_number = lo->lo_number;
 	disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
-out_unlock:
 	mutex_unlock(&lo->lo_mutex);
+
 	if (partscan) {
 		/*
 		 * open_mutex has been held already in release path, so don't
@@ -1435,7 +1436,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
 		/* Device is gone, no point in returning error */
 		err = 0;
 	}
-
+out:
 	/*
 	 * lo->lo_state is set to Lo_unbound here after above partscan has
 	 * finished.
Tetsuo Handa Aug. 16, 2021, 2:44 p.m. UTC | #5
On 2021/08/16 16:33, Christoph Hellwig wrote:
> This is the wrong way to approach it.  Instead of making things ever
> more complex try to make it simpler.  In this case, ensure that the
> destroy_workqueue is not held with pointless locks.  Given that the
> loop device already is known to not have a reference and marked as in
> the rundown state there shouldn't be anything that is required to
> be protected by lo_mutex.  So something like this untested patch
> should probably do the work:

I tested your untested patch, and I confirmed that your untested patch
does not fix the circular locking problem, for you are not seeing the
entire dependency chain.

> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index fa1c298a8cfb..c734dc768316 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1347,16 +1347,15 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
>  	 * became visible.
>  	 */
>  
> -	mutex_lock(&lo->lo_mutex);
>  	if (WARN_ON_ONCE(lo->lo_state != Lo_rundown)) {
>  		err = -ENXIO;
> -		goto out_unlock;
> +		goto out;
>  	}

Would you please write your patch as verbose as mine? Since your patch
lacks why it is safe to make such change, I can't review your patch.
I wonder why "goto out;" (which sets lo->lo_state = Lo_unbound) is
the correct choice here.

>  
>  	filp = lo->lo_backing_file;
>  	if (filp == NULL) {
>  		err = -EINVAL;
> -		goto out_unlock;
> +		goto out;
>  	}

I also wonder when lo->lo_backing_file == NULL case is possible.

As far as I can see, "lo->lo_state = Lo_rundown;" is done only when
"lo->lo_state == Lo_bound". And "lo->lo_state = Lo_bound;" is done
only when "lo->lo_backing_file = file;" was done.

I guess this is another sanity check which should use WARN_ON_ONCE().

>  
>  	if (test_bit(QUEUE_FLAG_WC, &lo->lo_queue->queue_flags))
> @@ -1366,6 +1365,8 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
>  	blk_mq_freeze_queue(lo->lo_queue);

I also wonder why it is safe to call blk_mq_freeze_queue() without holding
lo->lo_mutex. loop_change_fd(), loop_set_status(), lo_release() etc. are
calling blk_mq_freeze_queue() with lo->lo_mutex held.

If we need to hold lo->lo_mutex when calling blk_mq_freeze_queue() from
__loop_clr_fd(), why it is safe to once release lo->lo_mutex before
destroy_workqueue() and reacquire lo->lo_mutex after destroy_workqueue() ?

Since there is too little comment regarding what lock protects what resource
and/or operation, nobody can review the correctness of locking in loop module.
The loop module is a labyrinth...

>  
>  	destroy_workqueue(lo->workqueue);
> +
> +	mutex_lock(&lo->lo_mutex);
>  	spin_lock_irq(&lo->lo_work_lock);
>  	list_for_each_entry_safe(worker, pos, &lo->idle_worker_list,
>  				idle_list) {
> @@ -1413,8 +1414,8 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
>  	partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev;
>  	lo_number = lo->lo_number;
>  	disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
> -out_unlock:
>  	mutex_unlock(&lo->lo_mutex);
> +
>  	if (partscan) {
>  		/*
>  		 * open_mutex has been held already in release path, so don't
> @@ -1435,7 +1436,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
>  		/* Device is gone, no point in returning error */
>  		err = 0;
>  	}
> -
> +out:
>  	/*
>  	 * lo->lo_state is set to Lo_unbound here after above partscan has
>  	 * finished.
> 

Anyway, I modified your patch as below and tested on v5.14-rc6.

----------------------------------------
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index f0cdff0c5fbf..daa47cea8a32 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1349,17 +1349,12 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
 	 * became visible.
 	 */
 
-	mutex_lock(&lo->lo_mutex);
-	if (WARN_ON_ONCE(lo->lo_state != Lo_rundown)) {
-		err = -ENXIO;
-		goto out_unlock;
-	}
+	if (WARN_ON_ONCE(lo->lo_state != Lo_rundown))
+		return -ENXIO;
 
 	filp = lo->lo_backing_file;
-	if (filp == NULL) {
-		err = -EINVAL;
-		goto out_unlock;
-	}
+	if (WARN_ON_ONCE(!filp))
+		return -EINVAL;
 
 	if (test_bit(QUEUE_FLAG_WC, &lo->lo_queue->queue_flags))
 		blk_queue_write_cache(lo->lo_queue, false, false);
@@ -1368,6 +1363,8 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
 	blk_mq_freeze_queue(lo->lo_queue);
 
 	destroy_workqueue(lo->workqueue);
+
+	mutex_lock(&lo->lo_mutex);
 	spin_lock_irq(&lo->lo_work_lock);
 	list_for_each_entry_safe(worker, pos, &lo->idle_worker_list,
 				idle_list) {
@@ -1415,8 +1412,8 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
 
 	partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev;
 	lo_number = lo->lo_number;
-out_unlock:
 	mutex_unlock(&lo->lo_mutex);
+
 	if (partscan) {
 		/*
 		 * open_mutex has been held already in release path, so don't
diff --git a/drivers/mtd/mtd_blkdevs.c b/drivers/mtd/mtd_blkdevs.c
index 6ce4bc57f919..d02412055e6c 100644
--- a/drivers/mtd/mtd_blkdevs.c
+++ b/drivers/mtd/mtd_blkdevs.c
@@ -525,14 +525,10 @@ int register_mtd_blktrans(struct mtd_blktrans_ops *tr)
 	if (!blktrans_notifier.list.next)
 		register_mtd_user(&blktrans_notifier);
 
-
-	mutex_lock(&mtd_table_mutex);
-
 	ret = register_blkdev(tr->major, tr->name);
 	if (ret < 0) {
 		printk(KERN_WARNING "Unable to register %s block device on major %d: %d\n",
 		       tr->name, tr->major, ret);
-		mutex_unlock(&mtd_table_mutex);
 		return ret;
 	}
 
@@ -542,12 +538,12 @@ int register_mtd_blktrans(struct mtd_blktrans_ops *tr)
 	tr->blkshift = ffs(tr->blksize) - 1;
 
 	INIT_LIST_HEAD(&tr->devs);
-	list_add(&tr->list, &blktrans_majors);
 
+	mutex_lock(&mtd_table_mutex);
+	list_add(&tr->list, &blktrans_majors);
 	mtd_for_each_device(mtd)
 		if (mtd->type != MTD_ABSENT)
 			tr->add_mtd(tr, mtd);
-
 	mutex_unlock(&mtd_table_mutex);
 	return 0;
 }
----------------------------------------

And I immediately got the following lockdep warning (as I expected).

----------------------------------------
[  124.519262] loop0: detected capacity change from 0 to 4096
[  124.552028] EXT4-fs error (device loop0): ext4_fill_super:4956: inode #2: comm a.out: iget: root inode unallocated
[  124.556889] EXT4-fs (loop0): get root inode failed
[  124.559074] EXT4-fs (loop0): mount failed

[  124.917463] ======================================================
[  124.919741] WARNING: possible circular locking dependency detected
[  124.922233] 5.14.0-rc6+ #744 Not tainted
[  124.923799] ------------------------------------------------------
[  124.925996] systemd-udevd/6710 is trying to acquire lock:
[  124.928106] ffff88800dc82948 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x85/0x560
[  124.931100]
               but task is already holding lock:
[  124.933388] ffff88800dc87128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x30/0x180
[  124.936216]
               which lock already depends on the new lock.

[  124.939807]
               the existing dependency chain (in reverse order) is:
[  124.942741]
               -> #6 (&disk->open_mutex){+.+.}-{3:3}:
[  124.945672]        lock_acquire+0xd0/0x300
[  124.947321]        __mutex_lock+0x97/0x950
[  124.948938]        del_gendisk+0x4e/0x1d0
[  124.950666]        loop_remove+0x10/0x40
[  124.952238]        loop_control_ioctl+0x193/0x1a0
[  124.954068]        __x64_sys_ioctl+0x6a/0xa0
[  124.955736]        do_syscall_64+0x35/0xb0
[  124.957326]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  124.959358]
               -> #5 (loop_ctl_mutex){+.+.}-{3:3}:
[  124.962171]        lock_acquire+0xd0/0x300
[  124.963810]        __mutex_lock+0x97/0x950
[  124.965426]        loop_add+0x44/0x2c0
[  124.966890]        blk_request_module+0x63/0xc0
[  124.968526]        blkdev_get_no_open+0x98/0xc0
[  124.970361]        blkdev_get_by_dev+0x54/0x330
[  124.972108]        blkdev_open+0x59/0xa0
[  124.973681]        do_dentry_open+0x14c/0x3a0
[  124.975321]        path_openat+0x78e/0xa50
[  124.977043]        do_filp_open+0xad/0x120
[  124.979067]        do_sys_openat2+0x241/0x310
[  124.980969]        do_sys_open+0x3f/0x80
[  124.982581]        do_syscall_64+0x35/0xb0
[  124.984201]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  124.986265]
               -> #4 (major_names_lock){+.+.}-{3:3}:
[  124.988750]        lock_acquire+0xd0/0x300
[  124.990405]        __mutex_lock+0x97/0x950
[  124.992091]        blkdev_show+0x18/0x90
[  124.993674]        devinfo_show+0x58/0x70
[  124.995404]        seq_read_iter+0x27b/0x3f0
[  124.997072]        proc_reg_read_iter+0x3c/0x60
[  124.998719]        new_sync_read+0x110/0x190
[  125.000428]        vfs_read+0x11d/0x1b0
[  125.001867]        ksys_read+0x63/0xe0
[  125.003429]        do_syscall_64+0x35/0xb0
[  125.005099]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  125.007527]
               -> #3 (&p->lock){+.+.}-{3:3}:
[  125.009796]        lock_acquire+0xd0/0x300
[  125.011675]        __mutex_lock+0x97/0x950
[  125.013469]        seq_read_iter+0x4c/0x3f0
[  125.015295]        generic_file_splice_read+0xf7/0x1a0
[  125.017381]        splice_direct_to_actor+0xc0/0x230
[  125.019268]        do_splice_direct+0x8c/0xd0
[  125.021085]        do_sendfile+0x319/0x5a0
[  125.022737]        __x64_sys_sendfile64+0xad/0xc0
[  125.024571]        do_syscall_64+0x35/0xb0
[  125.026347]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  125.028731]
               -> #2 (sb_writers#3){.+.+}-{0:0}:
[  125.031219]        lock_acquire+0xd0/0x300
[  125.032799]        lo_write_bvec+0xe1/0x260
[  125.034456]        loop_process_work+0x3e5/0xcf0
[  125.036629]        process_one_work+0x2aa/0x600
[  125.038524]        worker_thread+0x48/0x3d0
[  125.040279]        kthread+0x13e/0x170
[  125.041824]        ret_from_fork+0x1f/0x30
[  125.043639]
               -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
[  125.047124]        lock_acquire+0xd0/0x300
[  125.048763]        process_one_work+0x27e/0x600
[  125.050544]        worker_thread+0x48/0x3d0
[  125.052206]        kthread+0x13e/0x170
[  125.053703]        ret_from_fork+0x1f/0x30
[  125.055324]
               -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
[  125.057947]        check_prev_add+0x91/0xc00
[  125.059585]        __lock_acquire+0x14a8/0x1f40
[  125.061433]        lock_acquire+0xd0/0x300
[  125.063203]        flush_workqueue+0xa9/0x560
[  125.064891]        drain_workqueue+0x9b/0x100
[  125.066638]        destroy_workqueue+0x2f/0x210
[  125.068305]        __loop_clr_fd+0xa9/0x5b0
[  125.070086]        blkdev_put+0xaf/0x180
[  125.071606]        blkdev_close+0x20/0x30
[  125.073522]        __fput+0xa0/0x240
[  125.074924]        task_work_run+0x57/0xa0
[  125.076564]        exit_to_user_mode_prepare+0x252/0x260
[  125.078666]        syscall_exit_to_user_mode+0x19/0x60
[  125.080784]        do_syscall_64+0x42/0xb0
[  125.082360]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  125.084403]
               other info that might help us debug this:

[  125.087757] Chain exists of:
                 (wq_completion)loop0 --> loop_ctl_mutex --> &disk->open_mutex

[  125.092256]  Possible unsafe locking scenario:

[  125.094994]        CPU0                    CPU1
[  125.096963]        ----                    ----
[  125.098729]   lock(&disk->open_mutex);
[  125.100224]                                lock(loop_ctl_mutex);
[  125.102521]                                lock(&disk->open_mutex);
[  125.104862]   lock((wq_completion)loop0);
[  125.106511]
                *** DEADLOCK ***

[  125.109224] 1 lock held by systemd-udevd/6710:
[  125.111072]  #0: ffff88800dc87128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x30/0x180
[  125.114270]
               stack backtrace:
[  125.116193] CPU: 3 PID: 6710 Comm: systemd-udevd Not tainted 5.14.0-rc6+ #744
[  125.118706] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
[  125.122536] Call Trace:
[  125.123604]  dump_stack_lvl+0x57/0x7d
[  125.125064]  check_noncircular+0x114/0x130
[  125.126621]  check_prev_add+0x91/0xc00
[  125.128264]  ? __lock_acquire+0x5c1/0x1f40
[  125.129953]  __lock_acquire+0x14a8/0x1f40
[  125.131544]  lock_acquire+0xd0/0x300
[  125.132936]  ? flush_workqueue+0x85/0x560
[  125.134500]  ? lockdep_init_map_type+0x51/0x220
[  125.136299]  ? lockdep_init_map_type+0x51/0x220
[  125.138086]  flush_workqueue+0xa9/0x560
[  125.139761]  ? flush_workqueue+0x85/0x560
[  125.141318]  ? drain_workqueue+0x9b/0x100
[  125.142822]  drain_workqueue+0x9b/0x100
[  125.144486]  destroy_workqueue+0x2f/0x210
[  125.146190]  __loop_clr_fd+0xa9/0x5b0
[  125.147655]  ? __mutex_unlock_slowpath+0x40/0x2a0
[  125.149516]  blkdev_put+0xaf/0x180
[  125.151042]  blkdev_close+0x20/0x30
[  125.152548]  __fput+0xa0/0x240
[  125.153865]  task_work_run+0x57/0xa0
[  125.155394]  exit_to_user_mode_prepare+0x252/0x260
[  125.157248]  syscall_exit_to_user_mode+0x19/0x60
[  125.158997]  do_syscall_64+0x42/0xb0
[  125.160510]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  125.162554] RIP: 0033:0x7eff14494987
[  125.164075] Code: ff ff e8 9c 11 02 00 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 c3 5d f8 ff
[  125.170670] RSP: 002b:00007ffde2ab8e58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  125.173563] RAX: 0000000000000000 RBX: 00007eff13f1c788 RCX: 00007eff14494987
[  125.176138] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006
[  125.178858] RBP: 0000000000000006 R08: 0000562c4155aaf0 R09: 0000000000000000
[  125.181641] R10: 00007eff13f1c788 R11: 0000000000000246 R12: 0000000000000000
[  125.184197] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562c41542635
[  125.187428] loop0: detected capacity change from 0 to 4096
----------------------------------------

I was expecting this result from the beginning.

Before your patch, lockdep thinks that there are

  loop_ctl_mutex => &lo->lo_mutex => (wq_completion)loop$M =>
  (work_completion)&lo->rootcg_work => sb_writers#$N => &p->lock =>
  major_names_lock => loop_ctl_mutex

dependency chain and

  &disk->open_mutex => &lo->lo_mutex => (wq_completion)loop$M =>
  (work_completion)&lo->rootcg_work => sb_writers#$N => &p->lock =>
  major_names_lock => loop_ctl_mutex

dependency chain (as explained in my patch). Since your patch attempts to
take (wq_completion)loop$M without &lo->lo_mutex, the dependency chain
after your patch will become

  &disk->open_mutex => (wq_completion)loop$M =>
  (work_completion)&lo->rootcg_work => sb_writers#$N => &p->lock =>
  major_names_lock => loop_ctl_mutex

but after all

  loop_ctl_mutex => &disk->open_mutex

dependency is appended when loop_control_remove() from
loop_control_ioctl(LOOP_CTL_REMOVE) is called...

Not only it is unclear what lo->lo_mutex lock is protecting, but also it is
unclear and erroneous what loop_ctl_mutex lock is protecting. Nobody noticed
lack of loop_ctl_mutex protection in the following patch.

On 2021/08/15 15:52, Tetsuo Handa wrote:
> There would be two approaches for breaking this circular dependency.
> One is to kill loop_ctl_mutex => &lo->lo_mutex chain. The other is to
> kill major_names_lock => loop_ctl_mutex chain. This patch implements
> the latter, due to the following reasons.

F.Y.I. Below is a patch for implementing the former.
This is less suitable for backport, for this is more difficult to review.

 From 1b95e875071961b22420f69ac6c5a8ae9aa11638 Mon Sep 17 00:00:00 2001From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 14 Aug 2021 14:21:36 +0900
Subject: [PATCH] loop: break loop_ctl_mutex into loop_idr_spinlock and loop_removal_mutex

syzbot is reporting circular locking problem at __loop_clr_fd() [1], for
commit a160c6159d4a0cf8 ("block: add an optional probe callback to
major_names") is calling the module's probe function with major_names_lock
held.

Fortunately, since commit 990e78116d38059c ("block: loop: fix deadlock
between open and remove") stopped holding loop_ctl_mutex in lo_open(),
current role of loop_ctl_mutex is to serialize access to loop_index_idr
and loop_add()/loop_remove(); in other words, management of id for IDR.

Since IDR allows idr_replace(), we can reserve id for not-yet-initialized
and to-be-released loop devices, by assigning a dummy pointer to that
reserved id. By using idr_replace(), we can replace loop_ctl_mutex with
loop_idr_spinlock for protecting access to loop_index_idr.

By introducing loop_idr_spinlock, remaining role of loop_ctl_mutex can be
reduced to serialization of loop_control_remove(). Thus, let's rename
loop_ctl_mutex to loop_removal_mutex. Note that removing serialization
between loop_add() and loop_remove() has a side effect

  Thread A:             Thread B:

  loop_remove(1) {
    Lock loop_removal_mutex.
    lo = idr_find(1);
    idr_replace(dummy, 1);
    Unlock loop_removal_mutex.
                        loop_remove(1) {
                          Lock loop_removal_mutex.
                          lo = idr_find(1);
                          Unlock loop_removal_mutex.
                          Fails with -ENODEV due to lo == dummy.
                        }
                        loop_add(1) {
                          Fails with -EEXIST due to idr_alloc(1) failure.
                        }
    loop_remove(lo) {
      idr_remove(1);
    }
    return 0;
  }

which we could argue such sequence as a caller's bug. /dev/loop-control
users are expected to serialize ioctl() calls when passing non-negative
same id value. As long as ioctl() calls with non-negative same id value
are serialized, this approach can minimize serialization by
loop_removal_mutex.

In loop_control_remove(), we might expect that we can get rid of
loop_removal_mutex if we temporarily hide this lo (using idr_replace())
before holding lo_mutex and show this lo again (using idr_replace()) if
loop_remove() cannot be called. But we can't get rid of loop_removal_mutex
in order to close a use-after-free race window explained below.

I found that loop_unregister_transfer() which is called from
cleanup_cryptoloop() lacks serialization between kfree() from
loop_remove() from loop_control_remove() and mutex_lock() from
unregister_transfer_cb().

Both loop_control_remove() and loop_unregister_transfer() hold lo_mutex
of lo found from loop_index_idr, but loop_unregister_transfer() must
synchronously check all lo found from loop_index_idr (and call
loop_release_xfer() as needed) due to being called by module unloading
operation. Temporarily hiding lo on loop_control_remove() side can result
in failing to call loop_release_xfer() from unregister_transfer_cb() from
loop_unregister_transfer().

Given that cryptoloop is not safe for journaled file systems, I wonder
how many cryptoloop users are there. We could consider getting rid of
loop_unregister_transfer() by removing cleanup_cryptoloop() (i.e. make
cryptoloop no longer unloadable) which is the only in-tree caller of
loop_unregister_transfer() function.

For now, we need to hold loop_removal_mutex in loop_unregister_transfer().
In contrast, holding loop_removal_mutex in loop_exit() is pointless, for
all users must close /dev/loop-control and /dev/loop$num (in order to drop
module's refcount to 0) before loop_exit() starts, and nobody can open
/dev/loop-control or /dev/loop$num afterwards.

Link: https://syzkaller.appspot.com/bug?id=7bb10e8b62f83e4d445cdf4c13d69e407e629558 [1]
Reported-by: syzbot <syzbot+f61766d5763f9e7a118f@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: syzbot <syzbot+f61766d5763f9e7a118f@syzkaller.appspotmail.com>
---

 drivers/block/loop.c | 130 +++++++++++++++++++++++++++++--------------
 1 file changed, 89 insertions(+), 41 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index f0cdff0c5fbf..424d90f978d5 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -86,8 +86,15 @@
 
 #define LOOP_IDLE_WORKER_TIMEOUT (60 * HZ)
 
+/*
+ * Used for avoiding -EEXIST error at bdi_register() which happens when id is reused
+ * before bdi_unregister() completes, by preserving specific id for loop_index_idr.
+ */
+#define HIDDEN_LOOP_DEVICE ((struct loop_device *) -1)
+
 static DEFINE_IDR(loop_index_idr);
-static DEFINE_MUTEX(loop_ctl_mutex);
+static DEFINE_SPINLOCK(loop_idr_spinlock);
+static DEFINE_MUTEX(loop_removal_mutex);
 static DEFINE_MUTEX(loop_validate_mutex);
 
 /**
@@ -2113,28 +2120,37 @@ int loop_register_transfer(struct loop_func_table *funcs)
 	return 0;
 }
 
-static int unregister_transfer_cb(int id, void *ptr, void *data)
-{
-	struct loop_device *lo = ptr;
-	struct loop_func_table *xfer = data;
-
-	mutex_lock(&lo->lo_mutex);
-	if (lo->lo_encryption == xfer)
-		loop_release_xfer(lo);
-	mutex_unlock(&lo->lo_mutex);
-	return 0;
-}
-
 int loop_unregister_transfer(int number)
 {
 	unsigned int n = number;
 	struct loop_func_table *xfer;
+	struct loop_device *lo;
+	int id;
 
 	if (n == 0 || n >= MAX_LO_CRYPT || (xfer = xfer_funcs[n]) == NULL)
 		return -EINVAL;
 
 	xfer_funcs[n] = NULL;
-	idr_for_each(&loop_index_idr, &unregister_transfer_cb, xfer);
+
+	/*
+	 * Use loop_removal_mutex in order to make sure that
+	 * loop_control_remove() won't call loop_remove().
+	 */
+	mutex_lock(&loop_removal_mutex);
+	spin_lock(&loop_idr_spinlock);
+	idr_for_each_entry(&loop_index_idr, lo, id) {
+		if (lo == HIDDEN_LOOP_DEVICE)
+			continue;
+		spin_unlock(&loop_idr_spinlock);
+		mutex_lock(&lo->lo_mutex);
+		if (lo->lo_encryption == xfer)
+			loop_release_xfer(lo);
+		mutex_unlock(&lo->lo_mutex);
+		spin_lock(&loop_idr_spinlock);
+	}
+	spin_unlock(&loop_idr_spinlock);
+	mutex_unlock(&loop_removal_mutex);
+
 	return 0;
 }
 
@@ -2313,20 +2329,21 @@ static int loop_add(int i)
 		goto out;
 	lo->lo_state = Lo_unbound;
 
-	err = mutex_lock_killable(&loop_ctl_mutex);
-	if (err)
-		goto out_free_dev;
-
 	/* allocate id, if @id >= 0, we're requesting that specific id */
+	idr_preload(GFP_KERNEL);
+	/* Reserve id for this loop device, but keep this loop device hidden. */
+	spin_lock(&loop_idr_spinlock);
 	if (i >= 0) {
-		err = idr_alloc(&loop_index_idr, lo, i, i + 1, GFP_KERNEL);
+		err = idr_alloc(&loop_index_idr, HIDDEN_LOOP_DEVICE, i, i + 1, GFP_ATOMIC);
 		if (err == -ENOSPC)
 			err = -EEXIST;
 	} else {
-		err = idr_alloc(&loop_index_idr, lo, 0, 0, GFP_KERNEL);
+		err = idr_alloc(&loop_index_idr, HIDDEN_LOOP_DEVICE, 0, 0, GFP_ATOMIC);
 	}
+	spin_unlock(&loop_idr_spinlock);
+	idr_preload_end();
 	if (err < 0)
-		goto out_unlock;
+		goto out_free_dev;
 	i = err;
 
 	err = -ENOMEM;
@@ -2392,16 +2409,21 @@ static int loop_add(int i)
 	disk->private_data	= lo;
 	disk->queue		= lo->lo_queue;
 	sprintf(disk->disk_name, "loop%d", i);
+	/* Make this loop device reachable from pathname. */
 	add_disk(disk);
-	mutex_unlock(&loop_ctl_mutex);
+	/* Show this loop device. */
+	spin_lock(&loop_idr_spinlock);
+	idr_replace(&loop_index_idr, lo, i);
+	spin_unlock(&loop_idr_spinlock);
 	return i;
 
 out_cleanup_tags:
 	blk_mq_free_tag_set(&lo->tag_set);
 out_free_idr:
+	/* Release id reserved for lo->lo_number. */
+	spin_lock(&loop_idr_spinlock);
 	idr_remove(&loop_index_idr, i);
-out_unlock:
-	mutex_unlock(&loop_ctl_mutex);
+	spin_unlock(&loop_idr_spinlock);
 out_free_dev:
 	kfree(lo);
 out:
@@ -2410,9 +2432,15 @@ static int loop_add(int i)
 
 static void loop_remove(struct loop_device *lo)
 {
+	/* Make this loop device unreachable from pathname. */
 	del_gendisk(lo->lo_disk);
 	blk_cleanup_disk(lo->lo_disk);
 	blk_mq_free_tag_set(&lo->tag_set);
+	/* Release id used by lo->lo_number. */
+	spin_lock(&loop_idr_spinlock);
+	idr_remove(&loop_index_idr, lo->lo_number);
+	spin_unlock(&loop_idr_spinlock);
+	/* There is no route which can find this loop device. */
 	mutex_destroy(&lo->lo_mutex);
 	kfree(lo);
 }
@@ -2435,52 +2463,68 @@ static int loop_control_remove(int idx)
 		pr_warn("deleting an unspecified loop device is not supported.\n");
 		return -EINVAL;
 	}
-		
-	ret = mutex_lock_killable(&loop_ctl_mutex);
+
+	/* Serialize concurrent loop_control_remove() and loop_unregister_transfer(). */
+	ret = mutex_lock_killable(&loop_removal_mutex);
 	if (ret)
 		return ret;
 
+	/*
+	 * Identify the loop device to remove. Skip the device if it is owned by
+	 * loop_remove()/loop_add() where it is not safe to access lo_mutex.
+	 */
+	spin_lock(&loop_idr_spinlock);
 	lo = idr_find(&loop_index_idr, idx);
-	if (!lo) {
+	if (!lo || lo == HIDDEN_LOOP_DEVICE) {
+		spin_unlock(&loop_idr_spinlock);
 		ret = -ENODEV;
-		goto out_unlock_ctrl;
+		goto out_unlock;
 	}
+	spin_unlock(&loop_idr_spinlock);
 
 	ret = mutex_lock_killable(&lo->lo_mutex);
 	if (ret)
-		goto out_unlock_ctrl;
+		goto out_unlock;
 	if (lo->lo_state != Lo_unbound ||
 	    atomic_read(&lo->lo_refcnt) > 0) {
 		mutex_unlock(&lo->lo_mutex);
 		ret = -EBUSY;
-		goto out_unlock_ctrl;
+		goto out_unlock;
 	}
+	/* Mark this loop device no longer open()-able. */
 	lo->lo_state = Lo_deleting;
 	mutex_unlock(&lo->lo_mutex);
 
-	idr_remove(&loop_index_idr, lo->lo_number);
+	/* Hide this loop device, but keep lo->lo_number still held. */
+	spin_lock(&loop_idr_spinlock);
+	idr_replace(&loop_index_idr, HIDDEN_LOOP_DEVICE, lo->lo_number);
+	spin_unlock(&loop_idr_spinlock);
+	/* Allow loop_control_remove() and loop_unregister_transfer() to resume. */
+	mutex_unlock(&loop_removal_mutex);
+	/* Remove this loop device. */
 	loop_remove(lo);
-out_unlock_ctrl:
-	mutex_unlock(&loop_ctl_mutex);
+	return 0;
+out_unlock:
+	mutex_unlock(&loop_removal_mutex);
 	return ret;
 }
 
 static int loop_control_get_free(int idx)
 {
 	struct loop_device *lo;
-	int id, ret;
+	int id;
 
-	ret = mutex_lock_killable(&loop_ctl_mutex);
-	if (ret)
-		return ret;
+	spin_lock(&loop_idr_spinlock);
 	idr_for_each_entry(&loop_index_idr, lo, id) {
+		if (lo == HIDDEN_LOOP_DEVICE)
+			continue;
 		if (lo->lo_state == Lo_unbound)
 			goto found;
 	}
-	mutex_unlock(&loop_ctl_mutex);
+	spin_unlock(&loop_idr_spinlock);
 	return loop_add(-1);
 found:
-	mutex_unlock(&loop_ctl_mutex);
+	spin_unlock(&loop_idr_spinlock);
 	return id;
 }
 
@@ -2590,10 +2634,14 @@ static void __exit loop_exit(void)
 	unregister_blkdev(LOOP_MAJOR, "loop");
 	misc_deregister(&loop_misc);
 
-	mutex_lock(&loop_ctl_mutex);
+	/*
+	 * There can't be hidden loop device on loop_index_idr, for loop_add()/loop_remove()
+	 * can't be in progress when this module is unloading. Also, there is no need to use
+	 * loop_idr_spinlock here, for nobody else can access loop_index_idr when this module
+	 * is unloading.
+	 */
 	idr_for_each_entry(&loop_index_idr, lo, id)
 		loop_remove(lo);
-	mutex_unlock(&loop_ctl_mutex);
 
 	idr_destroy(&loop_index_idr);
 }
Tetsuo Handa Aug. 17, 2021, 10:18 a.m. UTC | #6
On 2021/08/17 17:10, Hillf Danton wrote:
> See if it is safe to kfree(lo) after removing it from idr, with the
> deadlock dissolved.

It is not safe to call loop_remove() after idr_remove(). Please see HIDDEN_LOOP_DEVICE magic
in "[PATCH] loop: break loop_ctl_mutex into loop_idr_spinlock and loop_removal_mutex".

> 
> --- x/drivers/block/loop.c
> +++ y/drivers/block/loop.c
> @@ -2459,7 +2459,9 @@ static int loop_control_remove(int idx)
>  	mutex_unlock(&lo->lo_mutex);
>  
>  	idr_remove(&loop_index_idr, lo->lo_number);
> +	mutex_unlock(&loop_ctl_mutex);
>  	loop_remove(lo);
> +	return 0;
>  out_unlock_ctrl:
>  	mutex_unlock(&loop_ctl_mutex);
>  	return ret;
> --
> 

"[PATCH] loop: break loop_ctl_mutex into loop_idr_spinlock and loop_removal_mutex" can be a further
improvement after "[PATCH v3] block: genhd: don't call probe function with major_names_lock held".

I really would like to apply "[PATCH v3] block: genhd: don't call probe function with major_names_lock held" first.
diff mbox series

Patch

diff --git a/block/genhd.c b/block/genhd.c
index 298ee78c1bda..e7c75c5aa831 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -162,6 +162,7 @@  static struct blk_major_name {
 	int major;
 	char name[16];
 	void (*probe)(dev_t devt);
+	struct module *owner;
 } *major_names[BLKDEV_MAJOR_HASH_SIZE];
 static DEFINE_MUTEX(major_names_lock);
 
@@ -190,7 +191,8 @@  void blkdev_show(struct seq_file *seqf, off_t offset)
  * @major: the requested major device number [1..BLKDEV_MAJOR_MAX-1]. If
  *         @major = 0, try to allocate any unused major number.
  * @name: the name of the new block device as a zero terminated string
- * @probe: allback that is called on access to any minor number of @major
+ * @probe: callback that is called on access to any minor number of @major
+ * @owner: THIS_MODULE if @probe is not NULL, ignored if @probe is NULL.
  *
  * The @name must be unique within the system.
  *
@@ -208,7 +210,7 @@  void blkdev_show(struct seq_file *seqf, off_t offset)
  * Use register_blkdev instead for any new code.
  */
 int __register_blkdev(unsigned int major, const char *name,
-		void (*probe)(dev_t devt))
+		      void (*probe)(dev_t devt), struct module *owner)
 {
 	struct blk_major_name **n, *p;
 	int index, ret = 0;
@@ -248,6 +250,7 @@  int __register_blkdev(unsigned int major, const char *name,
 
 	p->major = major;
 	p->probe = probe;
+	p->owner = owner;
 	strlcpy(p->name, name, sizeof(p->name));
 	p->next = NULL;
 	index = major_to_index(major);
@@ -653,14 +656,29 @@  void blk_request_module(dev_t devt)
 {
 	unsigned int major = MAJOR(devt);
 	struct blk_major_name **n;
+	void (*probe_fn)(dev_t devt);
 
 	mutex_lock(&major_names_lock);
 	for (n = &major_names[major_to_index(major)]; *n; n = &(*n)->next) {
-		if ((*n)->major == major && (*n)->probe) {
-			(*n)->probe(devt);
-			mutex_unlock(&major_names_lock);
-			return;
-		}
+		if ((*n)->major != major || !(*n)->probe)
+			continue;
+		if (!try_module_get((*n)->owner))
+			break;
+		/*
+		 * Calling probe function with major_names_lock held causes
+		 * circular locking dependency problem. Thus, call it after
+		 * releasing major_names_lock.
+		 */
+		probe_fn = (*n)->probe;
+		mutex_unlock(&major_names_lock);
+		/*
+		 * Assuming that unregister_blkdev() is called from module's
+		 * __exit function, a module refcount taken above allows us
+		 * to safely call probe function without major_names_lock held.
+		 */
+		probe_fn(devt);
+		module_put((*n)->owner);
+		return;
 	}
 	mutex_unlock(&major_names_lock);
 
diff --git a/drivers/block/ataflop.c b/drivers/block/ataflop.c
index a093644ac39f..1b7fe10d49e7 100644
--- a/drivers/block/ataflop.c
+++ b/drivers/block/ataflop.c
@@ -2016,7 +2016,7 @@  static int __init atari_floppy_init (void)
 		return -ENODEV;
 
 	mutex_lock(&ataflop_probe_lock);
-	ret = __register_blkdev(FLOPPY_MAJOR, "fd", ataflop_probe);
+	ret = __register_blkdev(FLOPPY_MAJOR, "fd", ataflop_probe, THIS_MODULE);
 	if (ret)
 		goto out_unlock;
 
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 95694113e38e..d0bdfd56dfc8 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -487,7 +487,7 @@  static int __init brd_init(void)
 	 *	dynamically.
 	 */
 
-	if (__register_blkdev(RAMDISK_MAJOR, "ramdisk", brd_probe))
+	if (__register_blkdev(RAMDISK_MAJOR, "ramdisk", brd_probe, THIS_MODULE))
 		return -EIO;
 
 	brd_check_and_reset_par();
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index 87460e0e5c72..ee33ba03e6bd 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -4570,7 +4570,7 @@  static int __init do_floppy_init(void)
 		timer_setup(&motor_off_timer[drive], motor_off_callback, 0);
 	}
 
-	err = __register_blkdev(FLOPPY_MAJOR, "fd", floppy_probe);
+	err = __register_blkdev(FLOPPY_MAJOR, "fd", floppy_probe, THIS_MODULE);
 	if (err)
 		goto out_put_disk;
 
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index f0cdff0c5fbf..d6606c3b7d74 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -2564,7 +2564,7 @@  static int __init loop_init(void)
 		goto err_out;
 
 
-	if (__register_blkdev(LOOP_MAJOR, "loop", loop_probe)) {
+	if (__register_blkdev(LOOP_MAJOR, "loop", loop_probe, THIS_MODULE)) {
 		err = -EIO;
 		goto misc_out;
 	}
diff --git a/drivers/md/md.c b/drivers/md/md.c
index ae8fe54ea358..c13f45c0f502 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9584,11 +9584,11 @@  static int __init md_init(void)
 	if (!md_rdev_misc_wq)
 		goto err_rdev_misc_wq;
 
-	ret = __register_blkdev(MD_MAJOR, "md", md_probe);
+	ret = __register_blkdev(MD_MAJOR, "md", md_probe, THIS_MODULE);
 	if (ret < 0)
 		goto err_md;
 
-	ret = __register_blkdev(0, "mdp", md_probe);
+	ret = __register_blkdev(0, "mdp", md_probe, THIS_MODULE);
 	if (ret < 0)
 		goto err_mdp;
 	mdp_major = ret;
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index b8d55af763f9..ddd67a1045e7 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3729,7 +3729,7 @@  static int __init init_sd(void)
 	SCSI_LOG_HLQUEUE(3, printk("init_sd: sd driver entry point\n"));
 
 	for (i = 0; i < SD_MAJORS; i++) {
-		if (__register_blkdev(sd_major(i), "sd", sd_default_probe))
+		if (__register_blkdev(sd_major(i), "sd", sd_default_probe, THIS_MODULE))
 			continue;
 		majors++;
 	}
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 13b34177cc85..b0948003071d 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -303,9 +303,9 @@  struct gendisk *__blk_alloc_disk(int node);
 void blk_cleanup_disk(struct gendisk *disk);
 
 int __register_blkdev(unsigned int major, const char *name,
-		void (*probe)(dev_t devt));
+		      void (*probe)(dev_t devt), struct module *owner);
 #define register_blkdev(major, name) \
-	__register_blkdev(major, name, NULL)
+	__register_blkdev(major, name, NULL, NULL)
 void unregister_blkdev(unsigned int major, const char *name);
 
 bool bdev_check_media_change(struct block_device *bdev);