scsi: fix race condition when removing target

In commit fbce4d97fd43 ("scsi: fixup kernel warning during rmmod()"), we
removed scsi_device_get() and directly called get_device() to increase
the refcount of the device. But actullay scsi_device_get() will fail in
three cases:
1. the scsi device is in SDEV_DEL or SDEV_CANCEL state
2. get_device() fail
3. the module is not alive

The intended purpose was to remove the check of the module alive.
Unfortunately the check of the device state was droped too. And this
introduced a race condition like this:

      CPU0                                           CPU1
__scsi_remove_target()
  ->iterate shost->__devices
  ->scsi_remove_device()
  ->put_device()
      someone still hold a refcount
                                                   sd_release()
                                                      ->scsi_disk_put()
                                                      ->put_device() last put and trigger the device release

  ->goto restart
  ->iterate shost->__devices and got the same device
  ->get_device() while refcount is 0
  ->scsi_remove_device()
  ->put_device() refcount decreased to 0 again
  ->scsi_device_dev_release()
  ->scsi_device_dev_release_usercontext()

                                                      ->scsi_device_dev_release()
                                                      ->scsi_device_dev_release_usercontext()

The same scsi device will be found agian because it is in the shost->__devices
list until scsi_device_dev_release_usercontext() called, although the device
state was set to SDEV_DEL after the first scsi_remove_device().

Finally we got a oops in scsi_device_dev_release_usercontext() when the second
time be called.

Call trace:
[<ffff0000086bc624>] scsi_device_dev_release_usercontext+0x7c/0x1c0
[<ffff0000080f1f90>] execute_in_process_context+0x70/0x80
[<ffff0000086bc598>] scsi_device_dev_release+0x28/0x38
[<ffff0000086662cc>] device_release+0x3c/0xa0
[<ffff000008c2e780>] kobject_put+0x80/0xf0
[<ffff0000086666fc>] put_device+0x24/0x30
[<ffff0000086aeee0>] scsi_device_put+0x30/0x40
[<ffff000008704894>] scsi_disk_put+0x44/0x60
[<ffff000008704a50>] sd_release+0x50/0x80
[<ffff0000082bc704>] __blkdev_put+0x21c/0x230
[<ffff0000082bcb2c>] blkdev_put+0x54/0x118
[<ffff0000082bcc1c>] blkdev_close+0x2c/0x40
[<ffff000008279b64>] __fput+0x94/0x1d8
[<ffff000008279d20>] ____fput+0x20/0x30
[<ffff0000080f6f54>] task_work_run+0x9c/0xb8
[<ffff0000080dba64>] do_exit+0x2b4/0x9f8
[<ffff0000080dc234>] do_group_exit+0x3c/0xa0
[<ffff0000080dc2b8>] __wake_up_parent+0x0/0x40

And sometimes in __scsi_remove_target() it will loop for a long time
removing the same device if someone else holding a refcount until the
last refcount is released.

Notice that if CONFIG_REFCOUNT_FULL is open this race won't be triggered
because the full refcount implement will prevent the refcount increase
when it is 0.

Fix this by checking the sdev_state again like we did before in
scsi_device_get(). Then when iterating shost again we will skip the device
deleted because scsi_remove_device() will set the device state to
SDEV_CANCEL or SDEV_DEL.

Fixes: fbce4d97fd43 ("scsi: fixup kernel warning during rmmod()")
Signed-off-by: Jason Yan <yanaijie@huawei.com>
CC: Hannes Reinecke <hare@suse.de>
CC: Christoph Hellwig <hch@lst.de>
CC: Johannes Thumshirn <jthumshirn@suse.de>
CC: Zhaohongjiang <zhaohongjiang@huawei.com>
CC: Miao Xie <miaoxie@huawei.com>
---
 drivers/scsi/scsi_sysfs.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

scsi: fix race condition when removing target

Commit Message

Comments

Patch