diff mbox

enclosure: fix sysfs symlinks creation when using multipath

Message ID 1998341d-7f47-f73a-c5c5-4095b7fa9ef1@linux.vnet.ibm.com (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Douglas Miller June 16, 2017, 3:41 p.m. UTC
On 03/16/2017 01:49 PM, James Bottomley wrote:
> On Wed, 2017-03-15 at 19:39 -0400, Martin K. Petersen wrote:
>> Maurizio Lombardi <mlombard@redhat.com> writes:
>>
>>> With multipath, it may happen that the same device is passed to
>>> enclosure_add_device() multiple times and that the
>>> enclosure_add_links() function fails to create the symlinks because
>>> the device's sysfs directory entry is still NULL.  In this case,
>>> the
>>> links will never be created because all the subsequent calls to
>>> enclosure_add_device() will immediately fail with EEXIST.
>> James?
> Well I don't think the patch is the correct way to do this.  The
> problem is that if we encounter an error creating the links, we
> shouldn't add the device to the enclosure.  There's no need of a
> links_created variable (see below).
>
> However, more interesting is why the link creation failed in the first
> place.  The device clearly seems to exist because it was added to sysfs
> at time index 19.2 and the enclosure didn't try to use it until 60.0.
>   Can you debug this a bit more, please?  I can't see anything specific
> to multipath in the trace, so whatever this is looks like it could
> happen in the single path case as well.
>
> James
>
> diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c
> index 65fed71..ae89082 100644
> --- a/drivers/misc/enclosure.c
> +++ b/drivers/misc/enclosure.c
> @@ -375,6 +375,7 @@ int enclosure_add_device(struct enclosure_device *edev, int component,
>   			 struct device *dev)
>   {
>   	struct enclosure_component *cdev;
> +	int err;
>   
>   	if (!edev || component >= edev->components)
>   		return -EINVAL;
> @@ -384,12 +385,15 @@ int enclosure_add_device(struct enclosure_device *edev, int component,
>   	if (cdev->dev == dev)
>   		return -EEXIST;
>   
> -	if (cdev->dev)
> +	if (cdev->dev) {
>   		enclosure_remove_links(cdev);
> -
> -	put_device(cdev->dev);
> -	cdev->dev = get_device(dev);
> -	return enclosure_add_links(cdev);
> +		put_device(cdev->dev);
> +		cdev->dev = NULL;
> +	}
> +	err = enclosure_add_links(cdev);
> +	if (!err)
> +		cdev->dev = get_device(dev);
> +	return err;
>   }
>   EXPORT_SYMBOL_GPL(enclosure_add_device);
>   
After stumbling across the NULL pointer panic, I was able to use 
Maurizio's second patch below:

                 return -EINVAL;
@@ -384,12 +385,17 @@ int enclosure_add_device(struct enclosure_device 
*edev, int component,
         if (cdev->dev == dev)
                 return -EEXIST;

-       if (cdev->dev)
+       if (cdev->dev) {
                 enclosure_remove_links(cdev);
-
-       put_device(cdev->dev);
+               put_device(cdev->dev);
+       }
         cdev->dev = get_device(dev);
-       return enclosure_add_links(cdev);
+       err = enclosure_add_links(cdev);
+       if (err) {
+               cdev->dev = NULL;
+               put_device(cdev->dev);
+       }
+       return err;
  }
  EXPORT_SYMBOL_GPL(enclosure_add_device);


I am able to pass my testing with this patch. I don't see an official 
submit of this patch, but will respond to it when I see one. Again, I am 
seeing the problem even without multipath.

Comments

Douglas Miller June 16, 2017, 4:08 p.m. UTC | #1
On 06/16/2017 10:41 AM, Douglas Miller wrote:
> On 03/16/2017 01:49 PM, James Bottomley wrote:
>> On Wed, 2017-03-15 at 19:39 -0400, Martin K. Petersen wrote:
>>> Maurizio Lombardi <mlombard@redhat.com> writes:
>>>
>>>> With multipath, it may happen that the same device is passed to
>>>> enclosure_add_device() multiple times and that the
>>>> enclosure_add_links() function fails to create the symlinks because
>>>> the device's sysfs directory entry is still NULL.  In this case,
>>>> the
>>>> links will never be created because all the subsequent calls to
>>>> enclosure_add_device() will immediately fail with EEXIST.
>>> James?
>> Well I don't think the patch is the correct way to do this.  The
>> problem is that if we encounter an error creating the links, we
>> shouldn't add the device to the enclosure.  There's no need of a
>> links_created variable (see below).
>>
>> However, more interesting is why the link creation failed in the first
>> place.  The device clearly seems to exist because it was added to sysfs
>> at time index 19.2 and the enclosure didn't try to use it until 60.0.
>>   Can you debug this a bit more, please?  I can't see anything specific
>> to multipath in the trace, so whatever this is looks like it could
>> happen in the single path case as well.
>>
>> James
>>
>> diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c
>> index 65fed71..ae89082 100644
>> --- a/drivers/misc/enclosure.c
>> +++ b/drivers/misc/enclosure.c
>> @@ -375,6 +375,7 @@ int enclosure_add_device(struct enclosure_device 
>> *edev, int component,
>>                struct device *dev)
>>   {
>>       struct enclosure_component *cdev;
>> +    int err;
>>         if (!edev || component >= edev->components)
>>           return -EINVAL;
>> @@ -384,12 +385,15 @@ int enclosure_add_device(struct 
>> enclosure_device *edev, int component,
>>       if (cdev->dev == dev)
>>           return -EEXIST;
>>   -    if (cdev->dev)
>> +    if (cdev->dev) {
>>           enclosure_remove_links(cdev);
>> -
>> -    put_device(cdev->dev);
>> -    cdev->dev = get_device(dev);
>> -    return enclosure_add_links(cdev);
>> +        put_device(cdev->dev);
>> +        cdev->dev = NULL;
>> +    }
>> +    err = enclosure_add_links(cdev);
>> +    if (!err)
>> +        cdev->dev = get_device(dev);
>> +    return err;
>>   }
>>   EXPORT_SYMBOL_GPL(enclosure_add_device);
> After stumbling across the NULL pointer panic, I was able to use 
> Maurizio's second patch below:
>
> diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c
> index 65fed71..6ac07ea 100644
> --- a/drivers/misc/enclosure.c
> +++ b/drivers/misc/enclosure.c
> @@ -375,6 +375,7 @@ int enclosure_add_device(struct enclosure_device 
> *edev, int component,
>                          struct device *dev)
>  {
>         struct enclosure_component *cdev;
> +       int err;
>
>         if (!edev || component >= edev->components)
>                 return -EINVAL;
> @@ -384,12 +385,17 @@ int enclosure_add_device(struct enclosure_device 
> *edev, int component,
>         if (cdev->dev == dev)
>                 return -EEXIST;
>
> -       if (cdev->dev)
> +       if (cdev->dev) {
>                 enclosure_remove_links(cdev);
> -
> -       put_device(cdev->dev);
> +               put_device(cdev->dev);
> +       }
>         cdev->dev = get_device(dev);
> -       return enclosure_add_links(cdev);
> +       err = enclosure_add_links(cdev);
> +       if (err) {
> +               cdev->dev = NULL;
> +               put_device(cdev->dev);
> +       }
> +       return err;
>  }
>  EXPORT_SYMBOL_GPL(enclosure_add_device);
>
>
> I am able to pass my testing with this patch. I don't see an official 
> submit of this patch, but will respond to it when I see one. Again, I 
> am seeing the problem even without multipath.
>
Just to respond to James' question on the cause. What I observed was a 
race condition between udevd (ses_init()) and a worker thread 
(do_scan_async()), where the worker thread is creating the directories 
that are the target of the symlinks being created by udevd. Something 
was happening when udevd caught up with the worker thread (so the target 
directory did not exist) and it seemed the worker thread either got 
preempted or else just could not stay ahead of udevd. This means that 
udevd started failing to create symlinks even though the worker thread 
eventually got them all created. I did observe what appeared to be 
preemption, as the creation of directories stopped until udevd finished 
failing all the (rest of the) symlinks. Although there may have been 
other explanations for what I saw.
Maurizio Lombardi June 20, 2017, 11:38 a.m. UTC | #2
Dne 16.6.2017 v 18:08 Douglas Miller napsal(a):
> Just to respond to James' question on the cause. What I observed was a race condition between udevd (ses_init()) and a worker thread (do_scan_async()),
> where the worker thread is creating the directories that are the target of the symlinks being created by udevd.
> Something was happening when udevd caught up with the worker thread (so the target directory did not exist) and it seemed the worker thread either got preempted or
> else just could not stay ahead of udevd. This means that udevd started failing to create symlinks even though the worker thread eventually got them all created.
> I did observe what appeared to be preemption, as the creation of directories stopped until udevd finished failing all the (rest of the) symlinks.
> Although there may have been other explanations for what I saw.
> 

> I am able to pass my testing with this patch. I don't see an official submit of this patch, but will respond to it when I see one.

Thanks Douglas for testing it, I will resubmit the patch if no one has any objections.

Maurizio.
Douglas Miller June 26, 2017, 2:22 p.m. UTC | #3
On 06/20/2017 06:38 AM, Maurizio Lombardi wrote:
>
> Dne 16.6.2017 v 18:08 Douglas Miller napsal(a):
>> Just to respond to James' question on the cause. What I observed was a race condition between udevd (ses_init()) and a worker thread (do_scan_async()),
>> where the worker thread is creating the directories that are the target of the symlinks being created by udevd.
>> Something was happening when udevd caught up with the worker thread (so the target directory did not exist) and it seemed the worker thread either got preempted or
>> else just could not stay ahead of udevd. This means that udevd started failing to create symlinks even though the worker thread eventually got them all created.
>> I did observe what appeared to be preemption, as the creation of directories stopped until udevd finished failing all the (rest of the) symlinks.
>> Although there may have been other explanations for what I saw.
>>
>> I am able to pass my testing with this patch. I don't see an official submit of this patch, but will respond to it when I see one.
> Thanks Douglas for testing it, I will resubmit the patch if no one has any objections.
>
> Maurizio.
>
I did not see any additional comments, and no objections. Is it time to 
submit the new patch?

Thanks,
Doug
diff mbox

Patch

diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c
index 65fed71..6ac07ea 100644
--- a/drivers/misc/enclosure.c
+++ b/drivers/misc/enclosure.c
@@ -375,6 +375,7 @@  int enclosure_add_device(struct enclosure_device 
*edev, int component,
                          struct device *dev)
  {
         struct enclosure_component *cdev;
+       int err;

         if (!edev || component >= edev->components)