diff mbox

scsi: fix race condition when removing target

Message ID 20171129030556.47833-1-yanaijie@huawei.com (mailing list archive)
State Changes Requested
Headers show

Commit Message

Jason Yan Nov. 29, 2017, 3:05 a.m. UTC
In commit fbce4d97fd43 ("scsi: fixup kernel warning during rmmod()"), we
removed scsi_device_get() and directly called get_device() to increase
the refcount of the device. But actullay scsi_device_get() will fail in
three cases:
1. the scsi device is in SDEV_DEL or SDEV_CANCEL state
2. get_device() fail
3. the module is not alive

The intended purpose was to remove the check of the module alive.
Unfortunately the check of the device state was droped too. And this
introduced a race condition like this:

      CPU0                                           CPU1
__scsi_remove_target()
  ->iterate shost->__devices
  ->scsi_remove_device()
  ->put_device()
      someone still hold a refcount
                                                   sd_release()
                                                      ->scsi_disk_put()
                                                      ->put_device() last put and trigger the device release

  ->goto restart
  ->iterate shost->__devices and got the same device
  ->get_device() while refcount is 0
  ->scsi_remove_device()
  ->put_device() refcount decreased to 0 again
  ->scsi_device_dev_release()
  ->scsi_device_dev_release_usercontext()

                                                      ->scsi_device_dev_release()
                                                      ->scsi_device_dev_release_usercontext()

The same scsi device will be found agian because it is in the shost->__devices
list until scsi_device_dev_release_usercontext() called, although the device
state was set to SDEV_DEL after the first scsi_remove_device().

Finally we got a oops in scsi_device_dev_release_usercontext() when the second
time be called.

Call trace:
[<ffff0000086bc624>] scsi_device_dev_release_usercontext+0x7c/0x1c0
[<ffff0000080f1f90>] execute_in_process_context+0x70/0x80
[<ffff0000086bc598>] scsi_device_dev_release+0x28/0x38
[<ffff0000086662cc>] device_release+0x3c/0xa0
[<ffff000008c2e780>] kobject_put+0x80/0xf0
[<ffff0000086666fc>] put_device+0x24/0x30
[<ffff0000086aeee0>] scsi_device_put+0x30/0x40
[<ffff000008704894>] scsi_disk_put+0x44/0x60
[<ffff000008704a50>] sd_release+0x50/0x80
[<ffff0000082bc704>] __blkdev_put+0x21c/0x230
[<ffff0000082bcb2c>] blkdev_put+0x54/0x118
[<ffff0000082bcc1c>] blkdev_close+0x2c/0x40
[<ffff000008279b64>] __fput+0x94/0x1d8
[<ffff000008279d20>] ____fput+0x20/0x30
[<ffff0000080f6f54>] task_work_run+0x9c/0xb8
[<ffff0000080dba64>] do_exit+0x2b4/0x9f8
[<ffff0000080dc234>] do_group_exit+0x3c/0xa0
[<ffff0000080dc2b8>] __wake_up_parent+0x0/0x40

And sometimes in __scsi_remove_target() it will loop for a long time
removing the same device if someone else holding a refcount until the
last refcount is released.

Notice that if CONFIG_REFCOUNT_FULL is open this race won't be triggered
because the full refcount implement will prevent the refcount increase
when it is 0.

Fix this by checking the sdev_state again like we did before in
scsi_device_get(). Then when iterating shost again we will skip the device
deleted because scsi_remove_device() will set the device state to
SDEV_CANCEL or SDEV_DEL.

Fixes: fbce4d97fd43 ("scsi: fixup kernel warning during rmmod()")
Signed-off-by: Jason Yan <yanaijie@huawei.com>
CC: Hannes Reinecke <hare@suse.de>
CC: Christoph Hellwig <hch@lst.de>
CC: Johannes Thumshirn <jthumshirn@suse.de>
CC: Zhaohongjiang <zhaohongjiang@huawei.com>
CC: Miao Xie <miaoxie@huawei.com>
---
 drivers/scsi/scsi_sysfs.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Comments

Hannes Reinecke Nov. 29, 2017, 7:41 a.m. UTC | #1
On 11/29/2017 04:05 AM, Jason Yan wrote:
> In commit fbce4d97fd43 ("scsi: fixup kernel warning during rmmod()"), we
> removed scsi_device_get() and directly called get_device() to increase
> the refcount of the device. But actullay scsi_device_get() will fail in
> three cases:
> 1. the scsi device is in SDEV_DEL or SDEV_CANCEL state
> 2. get_device() fail
> 3. the module is not alive
> 
> The intended purpose was to remove the check of the module alive.
> Unfortunately the check of the device state was droped too. And this
> introduced a race condition like this:
> 
>       CPU0                                           CPU1
> __scsi_remove_target()
>   ->iterate shost->__devices
>   ->scsi_remove_device()
>   ->put_device()
>       someone still hold a refcount
>                                                    sd_release()
>                                                       ->scsi_disk_put()
>                                                       ->put_device() last put and trigger the device release
> 
>   ->goto restart
>   ->iterate shost->__devices and got the same device
>   ->get_device() while refcount is 0
>   ->scsi_remove_device()
>   ->put_device() refcount decreased to 0 again
>   ->scsi_device_dev_release()
>   ->scsi_device_dev_release_usercontext()
> 
>                                                       ->scsi_device_dev_release()
>                                                       ->scsi_device_dev_release_usercontext()
> 
> The same scsi device will be found agian because it is in the shost->__devices
> list until scsi_device_dev_release_usercontext() called, although the device
> state was set to SDEV_DEL after the first scsi_remove_device().
> 
> Finally we got a oops in scsi_device_dev_release_usercontext() when the second
> time be called.
> 
> Call trace:
> [<ffff0000086bc624>] scsi_device_dev_release_usercontext+0x7c/0x1c0
> [<ffff0000080f1f90>] execute_in_process_context+0x70/0x80
> [<ffff0000086bc598>] scsi_device_dev_release+0x28/0x38
> [<ffff0000086662cc>] device_release+0x3c/0xa0
> [<ffff000008c2e780>] kobject_put+0x80/0xf0
> [<ffff0000086666fc>] put_device+0x24/0x30
> [<ffff0000086aeee0>] scsi_device_put+0x30/0x40
> [<ffff000008704894>] scsi_disk_put+0x44/0x60
> [<ffff000008704a50>] sd_release+0x50/0x80
> [<ffff0000082bc704>] __blkdev_put+0x21c/0x230
> [<ffff0000082bcb2c>] blkdev_put+0x54/0x118
> [<ffff0000082bcc1c>] blkdev_close+0x2c/0x40
> [<ffff000008279b64>] __fput+0x94/0x1d8
> [<ffff000008279d20>] ____fput+0x20/0x30
> [<ffff0000080f6f54>] task_work_run+0x9c/0xb8
> [<ffff0000080dba64>] do_exit+0x2b4/0x9f8
> [<ffff0000080dc234>] do_group_exit+0x3c/0xa0
> [<ffff0000080dc2b8>] __wake_up_parent+0x0/0x40
> 
> And sometimes in __scsi_remove_target() it will loop for a long time
> removing the same device if someone else holding a refcount until the
> last refcount is released.
> 
> Notice that if CONFIG_REFCOUNT_FULL is open this race won't be triggered
> because the full refcount implement will prevent the refcount increase
> when it is 0.
> 
> Fix this by checking the sdev_state again like we did before in
> scsi_device_get(). Then when iterating shost again we will skip the device
> deleted because scsi_remove_device() will set the device state to
> SDEV_CANCEL or SDEV_DEL.
> 
> Fixes: fbce4d97fd43 ("scsi: fixup kernel warning during rmmod()")
> Signed-off-by: Jason Yan <yanaijie@huawei.com>
> CC: Hannes Reinecke <hare@suse.de>
> CC: Christoph Hellwig <hch@lst.de>
> CC: Johannes Thumshirn <jthumshirn@suse.de>
> CC: Zhaohongjiang <zhaohongjiang@huawei.com>
> CC: Miao Xie <miaoxie@huawei.com>
> ---
>  drivers/scsi/scsi_sysfs.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index 50e7d7e..d398894 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -1398,6 +1398,15 @@ void scsi_remove_device(struct scsi_device *sdev)
>  }
>  EXPORT_SYMBOL(scsi_remove_device);
>  
> +static int scsi_device_get_not_deleted(struct scsi_device *sdev)
> +{
> +	if (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL)
> +		return -ENXIO;
> +	if (!get_device(&sdev->sdev_gendev))
> +		return -ENXIO;
> +	return 0;
> +}
> +
>  static void __scsi_remove_target(struct scsi_target *starget)
>  {
>  	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
> @@ -1415,7 +1424,7 @@ static void __scsi_remove_target(struct scsi_target *starget)
>  		 */
>  		if (sdev->channel != starget->channel ||
>  		    sdev->id != starget->id ||
> -		    !get_device(&sdev->sdev_gendev))
> +		    scsi_device_get_not_deleted(sdev))
>  			continue;
>  		spin_unlock_irqrestore(shost->host_lock, flags);
>  		scsi_remove_device(sdev);
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
Bart Van Assche Nov. 29, 2017, 4:18 p.m. UTC | #2
On Wed, 2017-11-29 at 11:05 +0800, Jason Yan wrote:
> In commit fbce4d97fd43 ("scsi: fixup kernel warning during rmmod()"), we

> removed scsi_device_get() and directly called get_device() to increase

> the refcount of the device. But actullay scsi_device_get() will fail in

> three cases:

> 1. the scsi device is in SDEV_DEL or SDEV_CANCEL state

> 2. get_device() fail

> 3. the module is not alive

> 

> The intended purpose was to remove the check of the module alive.

> Unfortunately the check of the device state was droped too. And this

> introduced a race condition like this:

> 

>       CPU0                                           CPU1

> __scsi_remove_target()

>   ->iterate shost->__devices

>   ->scsi_remove_device()

>   ->put_device()

>       someone still hold a refcount

>                                                    sd_release()

>                                                       ->scsi_disk_put()

>                                                       ->put_device() last put and trigger the device release

> 

>   ->goto restart

>   ->iterate shost->__devices and got the same device

>   ->get_device() while refcount is 0

>   ->scsi_remove_device()

>   ->put_device() refcount decreased to 0 again

>   ->scsi_device_dev_release()

>   ->scsi_device_dev_release_usercontext()

> 

>                                                       ->scsi_device_dev_release()

>                                                       ->scsi_device_dev_release_usercontext()

> 

> The same scsi device will be found agian because it is in the shost->__devices

> list until scsi_device_dev_release_usercontext() called, although the device

> state was set to SDEV_DEL after the first scsi_remove_device().

> 

> Finally we got a oops in scsi_device_dev_release_usercontext() when the second

> time be called.

> 

> Call trace:

> [<ffff0000086bc624>] scsi_device_dev_release_usercontext+0x7c/0x1c0

> [<ffff0000080f1f90>] execute_in_process_context+0x70/0x80

> [<ffff0000086bc598>] scsi_device_dev_release+0x28/0x38

> [<ffff0000086662cc>] device_release+0x3c/0xa0

> [<ffff000008c2e780>] kobject_put+0x80/0xf0

> [<ffff0000086666fc>] put_device+0x24/0x30

> [<ffff0000086aeee0>] scsi_device_put+0x30/0x40

> [<ffff000008704894>] scsi_disk_put+0x44/0x60

> [<ffff000008704a50>] sd_release+0x50/0x80

> [<ffff0000082bc704>] __blkdev_put+0x21c/0x230

> [<ffff0000082bcb2c>] blkdev_put+0x54/0x118

> [<ffff0000082bcc1c>] blkdev_close+0x2c/0x40

> [<ffff000008279b64>] __fput+0x94/0x1d8

> [<ffff000008279d20>] ____fput+0x20/0x30

> [<ffff0000080f6f54>] task_work_run+0x9c/0xb8

> [<ffff0000080dba64>] do_exit+0x2b4/0x9f8

> [<ffff0000080dc234>] do_group_exit+0x3c/0xa0

> [<ffff0000080dc2b8>] __wake_up_parent+0x0/0x40

> 

> And sometimes in __scsi_remove_target() it will loop for a long time

> removing the same device if someone else holding a refcount until the

> last refcount is released.

> 

> Notice that if CONFIG_REFCOUNT_FULL is open this race won't be triggered

> because the full refcount implement will prevent the refcount increase

> when it is 0.

> 

> Fix this by checking the sdev_state again like we did before in

> scsi_device_get(). Then when iterating shost again we will skip the device

> deleted because scsi_remove_device() will set the device state to

> SDEV_CANCEL or SDEV_DEL.

> 

> Fixes: fbce4d97fd43 ("scsi: fixup kernel warning during rmmod()")

> Signed-off-by: Jason Yan <yanaijie@huawei.com>

> CC: Hannes Reinecke <hare@suse.de>

> CC: Christoph Hellwig <hch@lst.de>

> CC: Johannes Thumshirn <jthumshirn@suse.de>

> CC: Zhaohongjiang <zhaohongjiang@huawei.com>

> CC: Miao Xie <miaoxie@huawei.com>

> ---

>  drivers/scsi/scsi_sysfs.c | 11 ++++++++++-

>  1 file changed, 10 insertions(+), 1 deletion(-)

> 

> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c

> index 50e7d7e..d398894 100644

> --- a/drivers/scsi/scsi_sysfs.c

> +++ b/drivers/scsi/scsi_sysfs.c

> @@ -1398,6 +1398,15 @@ void scsi_remove_device(struct scsi_device *sdev)

>  }

>  EXPORT_SYMBOL(scsi_remove_device);

>  

> +static int scsi_device_get_not_deleted(struct scsi_device *sdev)

> +{

> +	if (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL)

> +		return -ENXIO;

> +	if (!get_device(&sdev->sdev_gendev))

> +		return -ENXIO;

> +	return 0;

> +}

> +

>  static void __scsi_remove_target(struct scsi_target *starget)

>  {

>  	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);

> @@ -1415,7 +1424,7 @@ static void __scsi_remove_target(struct scsi_target *starget)

>  		 */

>  		if (sdev->channel != starget->channel ||

>  		    sdev->id != starget->id ||

> -		    !get_device(&sdev->sdev_gendev))

> +		    scsi_device_get_not_deleted(sdev))

>  			continue;

>  		spin_unlock_irqrestore(shost->host_lock, flags);

>  		scsi_remove_device(sdev);


Hi Greg,

As the above patch description shows it can happen that the SCSI core calls
get_device() after the device reference count has reached zero and before
the memory for struct device is freed. Although the above patch looks fine
to me, would you consider it acceptable to modify get_device() such that it
uses kobject_get_unless_zero() instead of kobject_get()? I'm asking this
because that change would help to reduce the complexity of the already too
complicated SCSI core.

Thanks,

Bart.
Christoph Hellwig Nov. 29, 2017, 4:20 p.m. UTC | #3
On Wed, Nov 29, 2017 at 04:18:30PM +0000, Bart Van Assche wrote:
> As the above patch description shows it can happen that the SCSI core calls
> get_device() after the device reference count has reached zero and before
> the memory for struct device is freed. Although the above patch looks fine
> to me, would you consider it acceptable to modify get_device() such that it
> uses kobject_get_unless_zero() instead of kobject_get()? I'm asking this
> because that change would help to reduce the complexity of the already too
> complicated SCSI core.

I don't think we can just modify get_device, but we can add a new
get_device_unless_zero.  In fact I have an open coded variant of that
in nvme, and was planning to submit one for the current merge window..
James Bottomley Nov. 29, 2017, 4:31 p.m. UTC | #4
On Wed, 2017-11-29 at 11:05 +0800, Jason Yan wrote:
> In commit fbce4d97fd43 ("scsi: fixup kernel warning during rmmod()"),
> we
> removed scsi_device_get() and directly called get_device() to
> increase
> the refcount of the device. But actullay scsi_device_get() will fail
> in
> three cases:
> 1. the scsi device is in SDEV_DEL or SDEV_CANCEL state
> 2. get_device() fail
> 3. the module is not alive
> 
> The intended purpose was to remove the check of the module alive.
> Unfortunately the check of the device state was droped too. And this
> introduced a race condition like this:
> 
>       CPU0                                           CPU1
> __scsi_remove_target()
>   ->iterate shost->__devices
>   ->scsi_remove_device()
>   ->put_device()
>       someone still hold a refcount
>                                                    sd_release()
>                                                       -
> >scsi_disk_put()
>                                                       ->put_device()
> last put and trigger the device release
> 
>   ->goto restart
>   ->iterate shost->__devices and got the same device
>   ->get_device() while refcount is 0

This analysis fails here: get_device() on something with refcount 0
returns NULL.  That triggers the if clause to ignore this device.

We may have a more complex way of triggering a dual put race as the
trace implies, but I don't think this is it.

[...]
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index 50e7d7e..d398894 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -1398,6 +1398,15 @@ void scsi_remove_device(struct scsi_device
> *sdev)
>  }
>  EXPORT_SYMBOL(scsi_remove_device);
>  
> +static int scsi_device_get_not_deleted(struct scsi_device *sdev)
> +{
> +	if (sdev->sdev_state == SDEV_DEL || sdev->sdev_state ==
> SDEV_CANCEL)
> +		return -ENXIO;
> +	if (!get_device(&sdev->sdev_gendev))
> +		return -ENXIO;
> +	return 0;
> +}

This is pretty much scsi_device_get() without the try_module get, so
they should probably be combined.

James

>  static void __scsi_remove_target(struct scsi_target *starget)
>  {
>  	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
> @@ -1415,7 +1424,7 @@ static void __scsi_remove_target(struct
> scsi_target *starget)
>  		 */
>  		if (sdev->channel != starget->channel ||
>  		    sdev->id != starget->id ||
> -		    !get_device(&sdev->sdev_gendev))
> +		    scsi_device_get_not_deleted(sdev))
>  			continue;
>  		spin_unlock_irqrestore(shost->host_lock, flags);
>  		scsi_remove_device(sdev);
Christoph Hellwig Nov. 29, 2017, 4:34 p.m. UTC | #5
On Wed, Nov 29, 2017 at 08:31:48AM -0800, James Bottomley wrote:
> This analysis fails here: get_device() on something with refcount 0
> returns NULL.  That triggers the if clause to ignore this device.

No, it doesn't.  Take a look at the get_device and kobject_get
implementations,
James Bottomley Nov. 29, 2017, 4:47 p.m. UTC | #6
On Wed, 2017-11-29 at 17:34 +0100, Christoph Hellwig wrote:
> On Wed, Nov 29, 2017 at 08:31:48AM -0800, James Bottomley wrote:
> > 
> > This analysis fails here: get_device() on something with refcount 0
> > returns NULL.  That triggers the if clause to ignore this device.
> 
> No, it doesn't.  Take a look at the get_device and kobject_get
> implementations,

Hm, so why doesn't get_device use kref_get_unless_zero()?

James
Greg KH Nov. 29, 2017, 5:39 p.m. UTC | #7
On Wed, Nov 29, 2017 at 04:18:30PM +0000, Bart Van Assche wrote:
> On Wed, 2017-11-29 at 11:05 +0800, Jason Yan wrote:
> > In commit fbce4d97fd43 ("scsi: fixup kernel warning during rmmod()"), we
> > removed scsi_device_get() and directly called get_device() to increase
> > the refcount of the device. But actullay scsi_device_get() will fail in
> > three cases:
> > 1. the scsi device is in SDEV_DEL or SDEV_CANCEL state
> > 2. get_device() fail
> > 3. the module is not alive
> > 
> > The intended purpose was to remove the check of the module alive.
> > Unfortunately the check of the device state was droped too. And this
> > introduced a race condition like this:
> > 
> >       CPU0                                           CPU1
> > __scsi_remove_target()
> >   ->iterate shost->__devices
> >   ->scsi_remove_device()
> >   ->put_device()
> >       someone still hold a refcount
> >                                                    sd_release()
> >                                                       ->scsi_disk_put()
> >                                                       ->put_device() last put and trigger the device release
> > 
> >   ->goto restart
> >   ->iterate shost->__devices and got the same device
> >   ->get_device() while refcount is 0
> >   ->scsi_remove_device()
> >   ->put_device() refcount decreased to 0 again
> >   ->scsi_device_dev_release()
> >   ->scsi_device_dev_release_usercontext()
> > 
> >                                                       ->scsi_device_dev_release()
> >                                                       ->scsi_device_dev_release_usercontext()
> > 
> > The same scsi device will be found agian because it is in the shost->__devices
> > list until scsi_device_dev_release_usercontext() called, although the device
> > state was set to SDEV_DEL after the first scsi_remove_device().
> > 
> > Finally we got a oops in scsi_device_dev_release_usercontext() when the second
> > time be called.
> > 
> > Call trace:
> > [<ffff0000086bc624>] scsi_device_dev_release_usercontext+0x7c/0x1c0
> > [<ffff0000080f1f90>] execute_in_process_context+0x70/0x80
> > [<ffff0000086bc598>] scsi_device_dev_release+0x28/0x38
> > [<ffff0000086662cc>] device_release+0x3c/0xa0
> > [<ffff000008c2e780>] kobject_put+0x80/0xf0
> > [<ffff0000086666fc>] put_device+0x24/0x30
> > [<ffff0000086aeee0>] scsi_device_put+0x30/0x40
> > [<ffff000008704894>] scsi_disk_put+0x44/0x60
> > [<ffff000008704a50>] sd_release+0x50/0x80
> > [<ffff0000082bc704>] __blkdev_put+0x21c/0x230
> > [<ffff0000082bcb2c>] blkdev_put+0x54/0x118
> > [<ffff0000082bcc1c>] blkdev_close+0x2c/0x40
> > [<ffff000008279b64>] __fput+0x94/0x1d8
> > [<ffff000008279d20>] ____fput+0x20/0x30
> > [<ffff0000080f6f54>] task_work_run+0x9c/0xb8
> > [<ffff0000080dba64>] do_exit+0x2b4/0x9f8
> > [<ffff0000080dc234>] do_group_exit+0x3c/0xa0
> > [<ffff0000080dc2b8>] __wake_up_parent+0x0/0x40
> > 
> > And sometimes in __scsi_remove_target() it will loop for a long time
> > removing the same device if someone else holding a refcount until the
> > last refcount is released.
> > 
> > Notice that if CONFIG_REFCOUNT_FULL is open this race won't be triggered
> > because the full refcount implement will prevent the refcount increase
> > when it is 0.
> > 
> > Fix this by checking the sdev_state again like we did before in
> > scsi_device_get(). Then when iterating shost again we will skip the device
> > deleted because scsi_remove_device() will set the device state to
> > SDEV_CANCEL or SDEV_DEL.
> > 
> > Fixes: fbce4d97fd43 ("scsi: fixup kernel warning during rmmod()")
> > Signed-off-by: Jason Yan <yanaijie@huawei.com>
> > CC: Hannes Reinecke <hare@suse.de>
> > CC: Christoph Hellwig <hch@lst.de>
> > CC: Johannes Thumshirn <jthumshirn@suse.de>
> > CC: Zhaohongjiang <zhaohongjiang@huawei.com>
> > CC: Miao Xie <miaoxie@huawei.com>
> > ---
> >  drivers/scsi/scsi_sysfs.c | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> > index 50e7d7e..d398894 100644
> > --- a/drivers/scsi/scsi_sysfs.c
> > +++ b/drivers/scsi/scsi_sysfs.c
> > @@ -1398,6 +1398,15 @@ void scsi_remove_device(struct scsi_device *sdev)
> >  }
> >  EXPORT_SYMBOL(scsi_remove_device);
> >  
> > +static int scsi_device_get_not_deleted(struct scsi_device *sdev)
> > +{
> > +	if (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL)
> > +		return -ENXIO;
> > +	if (!get_device(&sdev->sdev_gendev))
> > +		return -ENXIO;
> > +	return 0;
> > +}
> > +
> >  static void __scsi_remove_target(struct scsi_target *starget)
> >  {
> >  	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
> > @@ -1415,7 +1424,7 @@ static void __scsi_remove_target(struct scsi_target *starget)
> >  		 */
> >  		if (sdev->channel != starget->channel ||
> >  		    sdev->id != starget->id ||
> > -		    !get_device(&sdev->sdev_gendev))
> > +		    scsi_device_get_not_deleted(sdev))
> >  			continue;
> >  		spin_unlock_irqrestore(shost->host_lock, flags);
> >  		scsi_remove_device(sdev);
> 
> Hi Greg,
> 
> As the above patch description shows it can happen that the SCSI core calls
> get_device() after the device reference count has reached zero and before
> the memory for struct device is freed. Although the above patch looks fine
> to me, would you consider it acceptable to modify get_device() such that it
> uses kobject_get_unless_zero() instead of kobject_get()? I'm asking this
> because that change would help to reduce the complexity of the already too
> complicated SCSI core.

Shouldn't there be a bus lock somewhere preventing this race?  Having an
open-coded put call isn't good, as you see here.

thanks,

greg k-h
Greg KH Nov. 29, 2017, 5:39 p.m. UTC | #8
On Wed, Nov 29, 2017 at 05:20:50PM +0100, hch@lst.de wrote:
> On Wed, Nov 29, 2017 at 04:18:30PM +0000, Bart Van Assche wrote:
> > As the above patch description shows it can happen that the SCSI core calls
> > get_device() after the device reference count has reached zero and before
> > the memory for struct device is freed. Although the above patch looks fine
> > to me, would you consider it acceptable to modify get_device() such that it
> > uses kobject_get_unless_zero() instead of kobject_get()? I'm asking this
> > because that change would help to reduce the complexity of the already too
> > complicated SCSI core.
> 
> I don't think we can just modify get_device, but we can add a new
> get_device_unless_zero.  In fact I have an open coded variant of that
> in nvme, and was planning to submit one for the current merge window..

I feel like that is just delaying the real fix, shouldn't there be a bus
lock somewhere on the put_device path for this bus to prevent this?

thanks,

greg k-h
Bart Van Assche Nov. 29, 2017, 5:47 p.m. UTC | #9
On Wed, 2017-11-29 at 17:39 +0000, gregkh@linuxfoundation.org wrote:
> On Wed, Nov 29, 2017 at 04:18:30PM +0000, Bart Van Assche wrote:

> > As the above patch description shows it can happen that the SCSI core calls

> > get_device() after the device reference count has reached zero and before

> > the memory for struct device is freed. Although the above patch looks fine

> > to me, would you consider it acceptable to modify get_device() such that it

> > uses kobject_get_unless_zero() instead of kobject_get()? I'm asking this

> > because that change would help to reduce the complexity of the already too

> > complicated SCSI core.

> 

> Shouldn't there be a bus lock somewhere preventing this race?  Having an

> open-coded put call isn't good, as you see here.


Hello Greg,

The get_device() call occurs with the SCSI host lock held. The SCSI host lock
serializes iteration over the sibling list by the get_device() caller and
removal of the SCSI host from the SCSI device sibling list by
scsi_device_dev_release_usercontext(). If you have a look at __scsi_remove_target()
then you will see that the host lock has to be released after a matching SCSI
target has been found and before scsi_remove_device() is called. The latter
function namely may sleep.

Bart.
Ewan Milne Nov. 29, 2017, 6:49 p.m. UTC | #10
On Wed, 2017-11-29 at 17:39 +0000, gregkh@linuxfoundation.org wrote:
> On Wed, Nov 29, 2017 at 05:20:50PM +0100, hch@lst.de wrote:
> > On Wed, Nov 29, 2017 at 04:18:30PM +0000, Bart Van Assche wrote:
> > > As the above patch description shows it can happen that the SCSI core calls
> > > get_device() after the device reference count has reached zero and before
> > > the memory for struct device is freed. Although the above patch looks fine
> > > to me, would you consider it acceptable to modify get_device() such that it
> > > uses kobject_get_unless_zero() instead of kobject_get()? I'm asking this
> > > because that change would help to reduce the complexity of the already too
> > > complicated SCSI core.
> > 
> > I don't think we can just modify get_device, but we can add a new
> > get_device_unless_zero.  In fact I have an open coded variant of that
> > in nvme, and was planning to submit one for the current merge window..
> 
> I feel like that is just delaying the real fix, shouldn't there be a bus
> lock somewhere on the put_device path for this bus to prevent this?
> 
> thanks,
> 
> greg k-h

Why is it that clients of the kobject code have to have their own
lock / state checking to prevent a duplicate destructor callback?
It seems to me like this is something the core functionality should
provide, because a get inside a destructor would *always* be wrong, no?

It looks like:

void refcount_inc(refcount_t *r)
{
        WARN_ONCE(!refcount_inc_not_zero(r), "refcount_t: increment on 0; use-after-free.\n");
}

would have warned if CONFIG_REFCOUNT_FULL was on, I/we don't normally
enable that though.

-Ewan
Ewan Milne Nov. 29, 2017, 7:05 p.m. UTC | #11
On Wed, 2017-11-29 at 11:05 +0800, Jason Yan wrote:
> In commit fbce4d97fd43 ("scsi: fixup kernel warning during rmmod()"), we
> removed scsi_device_get() and directly called get_device() to increase
> the refcount of the device. But actullay scsi_device_get() will fail in
> three cases:
> 1. the scsi device is in SDEV_DEL or SDEV_CANCEL state
> 2. get_device() fail
> 3. the module is not alive
> 
> The intended purpose was to remove the check of the module alive.
> Unfortunately the check of the device state was droped too. And this
> introduced a race condition like this:
> 
>       CPU0                                           CPU1
> __scsi_remove_target()
>   ->iterate shost->__devices
>   ->scsi_remove_device()
>   ->put_device()
>       someone still hold a refcount
>                                                    sd_release()
>                                                       ->scsi_disk_put()
>                                                       ->put_device() last put and trigger the device release
> 
>   ->goto restart
>   ->iterate shost->__devices and got the same device
>   ->get_device() while refcount is 0
>   ->scsi_remove_device()
>   ->put_device() refcount decreased to 0 again
>   ->scsi_device_dev_release()
>   ->scsi_device_dev_release_usercontext()
> 
>                                                       ->scsi_device_dev_release()
>                                                       ->scsi_device_dev_release_usercontext()
> 
> The same scsi device will be found agian because it is in the shost->__devices
> list until scsi_device_dev_release_usercontext() called, although the device
> state was set to SDEV_DEL after the first scsi_remove_device().
> 
> Finally we got a oops in scsi_device_dev_release_usercontext() when the second
> time be called.
> 
> Call trace:
> [<ffff0000086bc624>] scsi_device_dev_release_usercontext+0x7c/0x1c0
> [<ffff0000080f1f90>] execute_in_process_context+0x70/0x80
> [<ffff0000086bc598>] scsi_device_dev_release+0x28/0x38
> [<ffff0000086662cc>] device_release+0x3c/0xa0
> [<ffff000008c2e780>] kobject_put+0x80/0xf0
> [<ffff0000086666fc>] put_device+0x24/0x30
> [<ffff0000086aeee0>] scsi_device_put+0x30/0x40
> [<ffff000008704894>] scsi_disk_put+0x44/0x60
> [<ffff000008704a50>] sd_release+0x50/0x80
> [<ffff0000082bc704>] __blkdev_put+0x21c/0x230
> [<ffff0000082bcb2c>] blkdev_put+0x54/0x118
> [<ffff0000082bcc1c>] blkdev_close+0x2c/0x40
> [<ffff000008279b64>] __fput+0x94/0x1d8
> [<ffff000008279d20>] ____fput+0x20/0x30
> [<ffff0000080f6f54>] task_work_run+0x9c/0xb8
> [<ffff0000080dba64>] do_exit+0x2b4/0x9f8
> [<ffff0000080dc234>] do_group_exit+0x3c/0xa0
> [<ffff0000080dc2b8>] __wake_up_parent+0x0/0x40
> 
> And sometimes in __scsi_remove_target() it will loop for a long time
> removing the same device if someone else holding a refcount until the
> last refcount is released.
> 
> Notice that if CONFIG_REFCOUNT_FULL is open this race won't be triggered
> because the full refcount implement will prevent the refcount increase
> when it is 0.
> 
> Fix this by checking the sdev_state again like we did before in
> scsi_device_get(). Then when iterating shost again we will skip the device
> deleted because scsi_remove_device() will set the device state to
> SDEV_CANCEL or SDEV_DEL.
> 
> Fixes: fbce4d97fd43 ("scsi: fixup kernel warning during rmmod()")
> Signed-off-by: Jason Yan <yanaijie@huawei.com>
> CC: Hannes Reinecke <hare@suse.de>
> CC: Christoph Hellwig <hch@lst.de>
> CC: Johannes Thumshirn <jthumshirn@suse.de>
> CC: Zhaohongjiang <zhaohongjiang@huawei.com>
> CC: Miao Xie <miaoxie@huawei.com>
> ---
>  drivers/scsi/scsi_sysfs.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index 50e7d7e..d398894 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -1398,6 +1398,15 @@ void scsi_remove_device(struct scsi_device *sdev)
>  }
>  EXPORT_SYMBOL(scsi_remove_device);
>  
> +static int scsi_device_get_not_deleted(struct scsi_device *sdev)
> +{
> +	if (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL)
> +		return -ENXIO;
> +	if (!get_device(&sdev->sdev_gendev))
> +		return -ENXIO;
> +	return 0;
> +}
> +
>  static void __scsi_remove_target(struct scsi_target *starget)
>  {
>  	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
> @@ -1415,7 +1424,7 @@ static void __scsi_remove_target(struct scsi_target *starget)
>  		 */
>  		if (sdev->channel != starget->channel ||
>  		    sdev->id != starget->id ||
> -		    !get_device(&sdev->sdev_gendev))
> +		    scsi_device_get_not_deleted(sdev))
>  			continue;
>  		spin_unlock_irqrestore(shost->host_lock, flags);
>  		scsi_remove_device(sdev);

See subsequent discussion, however, we have a reproducible case here
and the patch does appear to fix the issue (500+ iterations).

Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Bart Van Assche Nov. 29, 2017, 7:11 p.m. UTC | #12
On Wed, 2017-11-29 at 13:49 -0500, Ewan D. Milne wrote:
> because a get inside a destructor would *always* be wrong, no?


Hello Ewan,

That's not what we are discussing. What can happen with the SCSI core is that
get_device() is called concurrently with the destructor. get_device() can be
called concurrently with the destructor because the destructore removes a
device from the siblings list and because the SCSI core can call get_device()
for devices it finds on the siblings list. Personally I think that design is
superior compared to removing a SCSI device from the sibling list before the
last put_device() call because the approach followed in the SCSI core leads to
a simpler implementation. However, it seems like the current get_device()
implementation does not yet support the SCSI core design ...

Bart.
Ewan Milne Nov. 29, 2017, 7:20 p.m. UTC | #13
On Wed, 2017-11-29 at 19:11 +0000, Bart Van Assche wrote:
> On Wed, 2017-11-29 at 13:49 -0500, Ewan D. Milne wrote:
> > because a get inside a destructor would *always* be wrong, no?
> 
> Hello Ewan,
> 
> That's not what we are discussing. What can happen with the SCSI core is that
> get_device() is called concurrently with the destructor. get_device() can be
> called concurrently with the destructor because the destructore removes a
> device from the siblings list and because the SCSI core can call get_device()
> for devices it finds on the siblings list. Personally I think that design is
> superior compared to removing a SCSI device from the sibling list before the
> last put_device() call because the approach followed in the SCSI core leads to
> a simpler implementation. However, it seems like the current get_device()
> implementation does not yet support the SCSI core design ...
> 
> Bart.

OK, well, I think the point still stands, though, once the refcount
goes to zero and the destructor is invoked, a get that then increments
the refcount seems fundamentally wrong to me.  Especially if a
subsequent put causes the destructor to be invoked *simultaneously*
*on another thread*.  The locking has to happen somewhere, why isn't
this done by the kobject?

Relying on the client code to get this right means that there are
opportunities all over the kernel for problems like this to happen,
just like here, where we inadvertently removed the state check that
prevented the get_device() call.

-Ewan
Bart Van Assche Nov. 29, 2017, 7:50 p.m. UTC | #14
On Wed, 2017-11-29 at 14:20 -0500, Ewan D. Milne wrote:
> OK, well, I think the point still stands, though, once the refcount

> goes to zero and the destructor is invoked, a get that then increments

> the refcount seems fundamentally wrong to me.


I agree that incrementing a reference count that has dropped to zero is wrong.
However, that's what happens currently. That behavior has been reported as a
bug. We need to fix this behavior, either through the patch at the start of
this thread or by using code that avoids to increment a zero reference count,
e.g. kobject_get_unless_zero().

Bart.
diff mbox

Patch

diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 50e7d7e..d398894 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -1398,6 +1398,15 @@  void scsi_remove_device(struct scsi_device *sdev)
 }
 EXPORT_SYMBOL(scsi_remove_device);
 
+static int scsi_device_get_not_deleted(struct scsi_device *sdev)
+{
+	if (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL)
+		return -ENXIO;
+	if (!get_device(&sdev->sdev_gendev))
+		return -ENXIO;
+	return 0;
+}
+
 static void __scsi_remove_target(struct scsi_target *starget)
 {
 	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
@@ -1415,7 +1424,7 @@  static void __scsi_remove_target(struct scsi_target *starget)
 		 */
 		if (sdev->channel != starget->channel ||
 		    sdev->id != starget->id ||
-		    !get_device(&sdev->sdev_gendev))
+		    scsi_device_get_not_deleted(sdev))
 			continue;
 		spin_unlock_irqrestore(shost->host_lock, flags);
 		scsi_remove_device(sdev);