Message ID | 2c17cf29fbce8fc1cfbf60cfd04559d00c8eeac0.1554756534.git.alifm@linux.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fio-ccw fixes for kernel stacktraces | expand |
On Mon, 8 Apr 2019 17:05:32 -0400 Farhan Ali <alifm@linux.ibm.com> wrote: > The quiesce function calls cio_cancel_halt_clear() and if we > get an -EBUSY we go into a loop where we: > - wait for any interrupts > - flush all I/O in the workqueue > - retry cio_cancel_halt_clear > > During the period where we are waiting for interrupts or > flushing all I/O, the channel subsystem could have completed > a halt/clear action and turned off the corresponding activity > control bits in the subchannel status word. This means the next > time we call cio_cancel_halt_clear(), we will again start by > calling cancel subchannel and so we can be stuck between calling > cancel and halt forever. > > Rather than calling cio_cancel_halt_clear() immediately after > waiting, let's try to disable the subchannel. If we succeed in > disabling the subchannel then we know nothing else can happen > with the device. > > Suggested-by: Eric Farman <farman@linux.ibm.com> > Signed-off-by: Farhan Ali <alifm@linux.ibm.com> > --- > drivers/s390/cio/vfio_ccw_drv.c | 27 ++++++++++++--------------- > 1 file changed, 12 insertions(+), 15 deletions(-) > > diff --git a/drivers/s390/cio/vfio_ccw_drv.c b/drivers/s390/cio/vfio_ccw_drv.c > index 5aca475..4405f2a 100644 > --- a/drivers/s390/cio/vfio_ccw_drv.c > +++ b/drivers/s390/cio/vfio_ccw_drv.c > @@ -43,26 +43,23 @@ int vfio_ccw_sch_quiesce(struct subchannel *sch) > if (ret != -EBUSY) > goto out_unlock; > > + iretry = 255; > do { > - iretry = 255; > > ret = cio_cancel_halt_clear(sch, &iretry); > - while (ret == -EBUSY) { > - /* > - * Flush all I/O and wait for > - * cancel/halt/clear completion. > - */ > - private->completion = &completion; > - spin_unlock_irq(sch->lock); > - > + /* > + * Flush all I/O and wait for > + * cancel/halt/clear completion. > + */ > + private->completion = &completion; > + spin_unlock_irq(sch->lock); > + > + if (ret == -EBUSY) I don't think you need to do the unlock/lock and change private->completion if you don't actually wait, no? Looking at the possible return codes: * -ENODEV -> device is not operational anyway, in theory you should even not need to bother with disabling the subchannel * -EIO -> we've run out of retries, and the subchannel still is not idle; I'm not sure if we could do anything here, as disable is unlikely to work, either * -EBUSY -> we expect an interrupt (or a timeout), the loop looks fine for that * 0 -> the one thing that might happen is that we get an unsolicited interrupt between the successful cancel_halt_clear and the disable; not even giving up the lock here might even be better here? I think this loop will probably work as it is after this patch, but giving up the lock when not really needed makes me a bit queasy... what do others think? > wait_for_completion_timeout(&completion, 3*HZ); > > - private->completion = NULL; > - flush_workqueue(vfio_ccw_work_q); > - spin_lock_irq(sch->lock); > - ret = cio_cancel_halt_clear(sch, &iretry); > - }; > - > + private->completion = NULL; > + flush_workqueue(vfio_ccw_work_q); > + spin_lock_irq(sch->lock); > ret = cio_disable_subchannel(sch); > } while (ret == -EBUSY); > out_unlock:
On 04/11/2019 12:24 PM, Cornelia Huck wrote: > On Mon, 8 Apr 2019 17:05:32 -0400 > Farhan Ali <alifm@linux.ibm.com> wrote: > >> The quiesce function calls cio_cancel_halt_clear() and if we >> get an -EBUSY we go into a loop where we: >> - wait for any interrupts >> - flush all I/O in the workqueue >> - retry cio_cancel_halt_clear >> >> During the period where we are waiting for interrupts or >> flushing all I/O, the channel subsystem could have completed >> a halt/clear action and turned off the corresponding activity >> control bits in the subchannel status word. This means the next >> time we call cio_cancel_halt_clear(), we will again start by >> calling cancel subchannel and so we can be stuck between calling >> cancel and halt forever. >> >> Rather than calling cio_cancel_halt_clear() immediately after >> waiting, let's try to disable the subchannel. If we succeed in >> disabling the subchannel then we know nothing else can happen >> with the device. >> >> Suggested-by: Eric Farman <farman@linux.ibm.com> >> Signed-off-by: Farhan Ali <alifm@linux.ibm.com> >> --- >> drivers/s390/cio/vfio_ccw_drv.c | 27 ++++++++++++--------------- >> 1 file changed, 12 insertions(+), 15 deletions(-) >> >> diff --git a/drivers/s390/cio/vfio_ccw_drv.c b/drivers/s390/cio/vfio_ccw_drv.c >> index 5aca475..4405f2a 100644 >> --- a/drivers/s390/cio/vfio_ccw_drv.c >> +++ b/drivers/s390/cio/vfio_ccw_drv.c >> @@ -43,26 +43,23 @@ int vfio_ccw_sch_quiesce(struct subchannel *sch) >> if (ret != -EBUSY) >> goto out_unlock; >> >> + iretry = 255; >> do { >> - iretry = 255; >> >> ret = cio_cancel_halt_clear(sch, &iretry); >> - while (ret == -EBUSY) { >> - /* >> - * Flush all I/O and wait for >> - * cancel/halt/clear completion. >> - */ >> - private->completion = &completion; >> - spin_unlock_irq(sch->lock); >> - >> + /* >> + * Flush all I/O and wait for >> + * cancel/halt/clear completion. >> + */ >> + private->completion = &completion; >> + spin_unlock_irq(sch->lock); >> + >> + if (ret == -EBUSY) > > I don't think you need to do the unlock/lock and change > private->completion if you don't actually wait, no? If we don't end up waiting, then changing private->completion would not be needed. But we would still need to release the spinlock due to [1]. > > Looking at the possible return codes: > * -ENODEV -> device is not operational anyway, in theory you should even > not need to bother with disabling the subchannel > * -EIO -> we've run out of retries, and the subchannel still is not > idle; I'm not sure if we could do anything here, as disable is > unlikely to work, either We could break out of the loop early for these cases. My thinking was I wanted to depend on the result of trying to disable, because ultimately that's what we want. I can add the cases to break out of the loop early. > * -EBUSY -> we expect an interrupt (or a timeout), the loop looks fine > for that > * 0 -> the one thing that might happen is that we get an unsolicited > interrupt between the successful cancel_halt_clear and the disable; > not even giving up the lock here might even be better here? I didn't think of this case, but if cancel_halt_clear succeeds with 0 then we should wait, no? > > I think this loop will probably work as it is after this patch, but > giving up the lock when not really needed makes me a bit queasy... what > do others think? > >> wait_for_completion_timeout(&completion, 3*HZ); >> >> - private->completion = NULL; >> - flush_workqueue(vfio_ccw_work_q); >> - spin_lock_irq(sch->lock); >> - ret = cio_cancel_halt_clear(sch, &iretry); >> - }; >> - >> + private->completion = NULL; [1] flush_workqueue can go to sleep so we would still need to release spinlock and reacquire it again to try disabling the subchannel. >> + flush_workqueue(vfio_ccw_work_q); >> + spin_lock_irq(sch->lock); >> ret = cio_disable_subchannel(sch); >> } while (ret == -EBUSY); >> out_unlock: > >
On Thu, 11 Apr 2019 16:30:44 -0400 Farhan Ali <alifm@linux.ibm.com> wrote: > On 04/11/2019 12:24 PM, Cornelia Huck wrote: > > On Mon, 8 Apr 2019 17:05:32 -0400 > > Farhan Ali <alifm@linux.ibm.com> wrote: > > > >> The quiesce function calls cio_cancel_halt_clear() and if we > >> get an -EBUSY we go into a loop where we: > >> - wait for any interrupts > >> - flush all I/O in the workqueue > >> - retry cio_cancel_halt_clear > >> > >> During the period where we are waiting for interrupts or > >> flushing all I/O, the channel subsystem could have completed > >> a halt/clear action and turned off the corresponding activity > >> control bits in the subchannel status word. This means the next > >> time we call cio_cancel_halt_clear(), we will again start by > >> calling cancel subchannel and so we can be stuck between calling > >> cancel and halt forever. > >> > >> Rather than calling cio_cancel_halt_clear() immediately after > >> waiting, let's try to disable the subchannel. If we succeed in > >> disabling the subchannel then we know nothing else can happen > >> with the device. > >> > >> Suggested-by: Eric Farman <farman@linux.ibm.com> > >> Signed-off-by: Farhan Ali <alifm@linux.ibm.com> > >> --- > >> drivers/s390/cio/vfio_ccw_drv.c | 27 ++++++++++++--------------- > >> 1 file changed, 12 insertions(+), 15 deletions(-) > >> > >> diff --git a/drivers/s390/cio/vfio_ccw_drv.c b/drivers/s390/cio/vfio_ccw_drv.c > >> index 5aca475..4405f2a 100644 > >> --- a/drivers/s390/cio/vfio_ccw_drv.c > >> +++ b/drivers/s390/cio/vfio_ccw_drv.c > >> @@ -43,26 +43,23 @@ int vfio_ccw_sch_quiesce(struct subchannel *sch) > >> if (ret != -EBUSY) > >> goto out_unlock; > >> > >> + iretry = 255; > >> do { > >> - iretry = 255; > >> > >> ret = cio_cancel_halt_clear(sch, &iretry); > >> - while (ret == -EBUSY) { > >> - /* > >> - * Flush all I/O and wait for > >> - * cancel/halt/clear completion. > >> - */ > >> - private->completion = &completion; > >> - spin_unlock_irq(sch->lock); > >> - > >> + /* > >> + * Flush all I/O and wait for > >> + * cancel/halt/clear completion. > >> + */ > >> + private->completion = &completion; > >> + spin_unlock_irq(sch->lock); > >> + > >> + if (ret == -EBUSY) > > > > I don't think you need to do the unlock/lock and change > > private->completion if you don't actually wait, no? > > If we don't end up waiting, then changing private->completion would not > be needed. But we would still need to release the spinlock due to [1]. > > > > > Looking at the possible return codes: > > * -ENODEV -> device is not operational anyway, in theory you should even > > not need to bother with disabling the subchannel > > * -EIO -> we've run out of retries, and the subchannel still is not > > idle; I'm not sure if we could do anything here, as disable is > > unlikely to work, either > > We could break out of the loop early for these cases. My thinking was I > wanted to depend on the result of trying to disable, because ultimately > that's what we want. > > I can add the cases to break out of the loop early. The -ENODEV case does not really hurt, as it will get us out of the loop anyway. But for the -EIO case, I think we'll get -EBUSY from the disable and stay within the loop endlessly? > > > > * -EBUSY -> we expect an interrupt (or a timeout), the loop looks fine > > for that > > * 0 -> the one thing that might happen is that we get an unsolicited > > interrupt between the successful cancel_halt_clear and the disable; > > not even giving up the lock here might even be better here? > > I didn't think of this case, but if cancel_halt_clear succeeds with 0 > then we should wait, no? For 0 I don't expect a solicited interrupt (documentation for the functions says that the subchannel is idle in that case); it's just the unsolicited interrupt that might get into the way. > > > > > I think this loop will probably work as it is after this patch, but > > giving up the lock when not really needed makes me a bit queasy... what > > do others think? > > > >> wait_for_completion_timeout(&completion, 3*HZ); > >> > >> - private->completion = NULL; > >> - flush_workqueue(vfio_ccw_work_q); > >> - spin_lock_irq(sch->lock); > >> - ret = cio_cancel_halt_clear(sch, &iretry); > >> - }; > >> - > >> + private->completion = NULL; > > [1] flush_workqueue can go to sleep so we would still need to release > spinlock and reacquire it again to try disabling the subchannel. Grr, I thought we could skip the flush in the !-EBUSY case, but I think we can't due to the possibility of an unsolicited interrupt... what simply adding handling for -EIO (although I'm not sure what we can sensibly do in that case) and leave the other cases as they are now? > > >> + flush_workqueue(vfio_ccw_work_q); > >> + spin_lock_irq(sch->lock); > >> ret = cio_disable_subchannel(sch); > >> } while (ret == -EBUSY); > >> out_unlock: > > > > >
On 04/12/2019 04:10 AM, Cornelia Huck wrote: > On Thu, 11 Apr 2019 16:30:44 -0400 > Farhan Ali <alifm@linux.ibm.com> wrote: > >> On 04/11/2019 12:24 PM, Cornelia Huck wrote: >>> On Mon, 8 Apr 2019 17:05:32 -0400 >>> Farhan Ali <alifm@linux.ibm.com> wrote: >>> >>>> The quiesce function calls cio_cancel_halt_clear() and if we >>>> get an -EBUSY we go into a loop where we: >>>> - wait for any interrupts >>>> - flush all I/O in the workqueue >>>> - retry cio_cancel_halt_clear >>>> >>>> During the period where we are waiting for interrupts or >>>> flushing all I/O, the channel subsystem could have completed >>>> a halt/clear action and turned off the corresponding activity >>>> control bits in the subchannel status word. This means the next >>>> time we call cio_cancel_halt_clear(), we will again start by >>>> calling cancel subchannel and so we can be stuck between calling >>>> cancel and halt forever. >>>> >>>> Rather than calling cio_cancel_halt_clear() immediately after >>>> waiting, let's try to disable the subchannel. If we succeed in >>>> disabling the subchannel then we know nothing else can happen >>>> with the device. >>>> >>>> Suggested-by: Eric Farman <farman@linux.ibm.com> >>>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com> >>>> --- >>>> drivers/s390/cio/vfio_ccw_drv.c | 27 ++++++++++++--------------- >>>> 1 file changed, 12 insertions(+), 15 deletions(-) >>>> >>>> diff --git a/drivers/s390/cio/vfio_ccw_drv.c b/drivers/s390/cio/vfio_ccw_drv.c >>>> index 5aca475..4405f2a 100644 >>>> --- a/drivers/s390/cio/vfio_ccw_drv.c >>>> +++ b/drivers/s390/cio/vfio_ccw_drv.c >>>> @@ -43,26 +43,23 @@ int vfio_ccw_sch_quiesce(struct subchannel *sch) >>>> if (ret != -EBUSY) >>>> goto out_unlock; >>>> >>>> + iretry = 255; >>>> do { >>>> - iretry = 255; >>>> >>>> ret = cio_cancel_halt_clear(sch, &iretry); >>>> - while (ret == -EBUSY) { >>>> - /* >>>> - * Flush all I/O and wait for >>>> - * cancel/halt/clear completion. >>>> - */ >>>> - private->completion = &completion; >>>> - spin_unlock_irq(sch->lock); >>>> - >>>> + /* >>>> + * Flush all I/O and wait for >>>> + * cancel/halt/clear completion. >>>> + */ >>>> + private->completion = &completion; >>>> + spin_unlock_irq(sch->lock); >>>> + >>>> + if (ret == -EBUSY) >>> >>> I don't think you need to do the unlock/lock and change >>> private->completion if you don't actually wait, no? >> >> If we don't end up waiting, then changing private->completion would not >> be needed. But we would still need to release the spinlock due to [1]. >> >>> >>> Looking at the possible return codes: >>> * -ENODEV -> device is not operational anyway, in theory you should even >>> not need to bother with disabling the subchannel >>> * -EIO -> we've run out of retries, and the subchannel still is not >>> idle; I'm not sure if we could do anything here, as disable is >>> unlikely to work, either >> >> We could break out of the loop early for these cases. My thinking was I >> wanted to depend on the result of trying to disable, because ultimately >> that's what we want. >> >> I can add the cases to break out of the loop early. > > The -ENODEV case does not really hurt, as it will get us out of the > loop anyway. But for the -EIO case, I think we'll get -EBUSY from the > disable and stay within the loop endlessly? > >> >> >>> * -EBUSY -> we expect an interrupt (or a timeout), the loop looks fine >>> for that >>> * 0 -> the one thing that might happen is that we get an unsolicited >>> interrupt between the successful cancel_halt_clear and the disable; >>> not even giving up the lock here might even be better here? >> >> I didn't think of this case, but if cancel_halt_clear succeeds with 0 >> then we should wait, no? > > For 0 I don't expect a solicited interrupt (documentation for the > functions says that the subchannel is idle in that case); it's just the > unsolicited interrupt that might get into the way. > >> >>> >>> I think this loop will probably work as it is after this patch, but >>> giving up the lock when not really needed makes me a bit queasy... what >>> do others think? >>> >>>> wait_for_completion_timeout(&completion, 3*HZ); >>>> >>>> - private->completion = NULL; >>>> - flush_workqueue(vfio_ccw_work_q); >>>> - spin_lock_irq(sch->lock); >>>> - ret = cio_cancel_halt_clear(sch, &iretry); >>>> - }; >>>> - >>>> + private->completion = NULL; >> >> [1] flush_workqueue can go to sleep so we would still need to release >> spinlock and reacquire it again to try disabling the subchannel. > > Grr, I thought we could skip the flush in the !-EBUSY case, but I think > we can't due to the possibility of an unsolicited interrupt... what > simply adding handling for -EIO (although I'm not sure what we can > sensibly do in that case) and leave the other cases as they are now? > Thinking a little bit more about EIO, if the return code is EIO then it means we have exhausted all our options with cancel_halt_clear and the subchannel/device is still status pending, right? I think we should still continue to try and disable the subchannel, because if not then the subchannel/device could in some point of time come back and bite us. So we really should protect the system from this behavior. I think for EIO we should log an error message, but still try to continue with disabling the subchannel. What do you or others think? >> >>>> + flush_workqueue(vfio_ccw_work_q); >>>> + spin_lock_irq(sch->lock); >>>> ret = cio_disable_subchannel(sch); >>>> } while (ret == -EBUSY); >>>> out_unlock: >>> >>> >> > >
On Fri, 12 Apr 2019 10:38:50 -0400 Farhan Ali <alifm@linux.ibm.com> wrote: > On 04/12/2019 04:10 AM, Cornelia Huck wrote: > > On Thu, 11 Apr 2019 16:30:44 -0400 > > Farhan Ali <alifm@linux.ibm.com> wrote: > > > >> On 04/11/2019 12:24 PM, Cornelia Huck wrote: > >>> On Mon, 8 Apr 2019 17:05:32 -0400 > >>> Farhan Ali <alifm@linux.ibm.com> wrote: > >>> Looking at the possible return codes: > >>> * -ENODEV -> device is not operational anyway, in theory you should even > >>> not need to bother with disabling the subchannel > >>> * -EIO -> we've run out of retries, and the subchannel still is not > >>> idle; I'm not sure if we could do anything here, as disable is > >>> unlikely to work, either (...) > Thinking a little bit more about EIO, if the return code is EIO then it > means we have exhausted all our options with cancel_halt_clear and the > subchannel/device is still status pending, right? Yes. > > I think we should still continue to try and disable the subchannel, > because if not then the subchannel/device could in some point of time > come back and bite us. So we really should protect the system from this > behavior. I think trying to disable the subchannel does not really hurt, but I fear it won't succeed in that case... > > I think for EIO we should log an error message, but still try to > continue with disabling the subchannel. What do you or others think? Logging an error may be useful (it's really fouled up at that time), but... > > > > > >> > >>>> + flush_workqueue(vfio_ccw_work_q); > >>>> + spin_lock_irq(sch->lock); > >>>> ret = cio_disable_subchannel(sch); ...there's a good chance that we'd get -EBUSY here, which would keep us in the loop. We probably need to break out after we got -EIO from cancel_halt_clear, regardless of which return code we get from the disable. (It will be "interesting" to see what happens with such a stuck subchannel in the calling code; but I don't really see many options. Panic seems way too strong; maybe mark the subchannel as "broken; no idea how to fix"? But that would be a follow-on patch; I think if we avoid the endless loop here, this patch is a real improvement and should just go in.) > >>>> } while (ret == -EBUSY); > >>>> out_unlock: > >>> > >>> > >> > > > > >
On 04/15/2019 04:13 AM, Cornelia Huck wrote: > On Fri, 12 Apr 2019 10:38:50 -0400 > Farhan Ali <alifm@linux.ibm.com> wrote: > >> On 04/12/2019 04:10 AM, Cornelia Huck wrote: >>> On Thu, 11 Apr 2019 16:30:44 -0400 >>> Farhan Ali <alifm@linux.ibm.com> wrote: >>> >>>> On 04/11/2019 12:24 PM, Cornelia Huck wrote: >>>>> On Mon, 8 Apr 2019 17:05:32 -0400 >>>>> Farhan Ali <alifm@linux.ibm.com> wrote: > >>>>> Looking at the possible return codes: >>>>> * -ENODEV -> device is not operational anyway, in theory you should even >>>>> not need to bother with disabling the subchannel >>>>> * -EIO -> we've run out of retries, and the subchannel still is not >>>>> idle; I'm not sure if we could do anything here, as disable is >>>>> unlikely to work, either > > (...) > >> Thinking a little bit more about EIO, if the return code is EIO then it >> means we have exhausted all our options with cancel_halt_clear and the >> subchannel/device is still status pending, right? > > Yes. > >> >> I think we should still continue to try and disable the subchannel, >> because if not then the subchannel/device could in some point of time >> come back and bite us. So we really should protect the system from this >> behavior. > > I think trying to disable the subchannel does not really hurt, but I > fear it won't succeed in that case... > >> >> I think for EIO we should log an error message, but still try to >> continue with disabling the subchannel. What do you or others think? > > Logging an error may be useful (it's really fouled up at that time), but... > >> >> >> >> >>>> >>>>>> + flush_workqueue(vfio_ccw_work_q); >>>>>> + spin_lock_irq(sch->lock); >>>>>> ret = cio_disable_subchannel(sch); > > ...there's a good chance that we'd get -EBUSY here, which would keep us > in the loop. We probably need to break out after we got -EIO from > cancel_halt_clear, regardless of which return code we get from the > disable. Okay, for EIO we can log an error message and break out of the loop. I will send a v3. Are you going to queue patch 1 or patch 3 soon? If you are then I will just send this patch separately. Thanks Farhan > > (It will be "interesting" to see what happens with such a stuck > subchannel in the calling code; but I don't really see many options. > Panic seems way too strong; maybe mark the subchannel as "broken; no > idea how to fix"? But that would be a follow-on patch; I think if we > avoid the endless loop here, this patch is a real improvement and > should just go in.) > >>>>>> } while (ret == -EBUSY); >>>>>> out_unlock: >>>>> >>>>> >>>> >>> >>> >> > >
On Mon, 15 Apr 2019 09:38:37 -0400 Farhan Ali <alifm@linux.ibm.com> wrote: > On 04/15/2019 04:13 AM, Cornelia Huck wrote: > > On Fri, 12 Apr 2019 10:38:50 -0400 > > Farhan Ali <alifm@linux.ibm.com> wrote: > > > >> On 04/12/2019 04:10 AM, Cornelia Huck wrote: > >>> On Thu, 11 Apr 2019 16:30:44 -0400 > >>> Farhan Ali <alifm@linux.ibm.com> wrote: > >>> > >>>> On 04/11/2019 12:24 PM, Cornelia Huck wrote: > >>>>> On Mon, 8 Apr 2019 17:05:32 -0400 > >>>>> Farhan Ali <alifm@linux.ibm.com> wrote: > > > >>>>> Looking at the possible return codes: > >>>>> * -ENODEV -> device is not operational anyway, in theory you should even > >>>>> not need to bother with disabling the subchannel > >>>>> * -EIO -> we've run out of retries, and the subchannel still is not > >>>>> idle; I'm not sure if we could do anything here, as disable is > >>>>> unlikely to work, either > > > > (...) > > > >> Thinking a little bit more about EIO, if the return code is EIO then it > >> means we have exhausted all our options with cancel_halt_clear and the > >> subchannel/device is still status pending, right? > > > > Yes. > > > >> > >> I think we should still continue to try and disable the subchannel, > >> because if not then the subchannel/device could in some point of time > >> come back and bite us. So we really should protect the system from this > >> behavior. > > > > I think trying to disable the subchannel does not really hurt, but I > > fear it won't succeed in that case... > > > >> > >> I think for EIO we should log an error message, but still try to > >> continue with disabling the subchannel. What do you or others think? > > > > Logging an error may be useful (it's really fouled up at that time), but... > > > >> > >> > >> > >> > >>>> > >>>>>> + flush_workqueue(vfio_ccw_work_q); > >>>>>> + spin_lock_irq(sch->lock); > >>>>>> ret = cio_disable_subchannel(sch); > > > > ...there's a good chance that we'd get -EBUSY here, which would keep us > > in the loop. We probably need to break out after we got -EIO from > > cancel_halt_clear, regardless of which return code we get from the > > disable. > > Okay, for EIO we can log an error message and break out of the loop. > > I will send a v3. Are you going to queue patch 1 or patch 3 soon? If you > are then I will just send this patch separately. Yes, please do send it separately. I'm currently testing patch 1 and 3 on top of my patchset, will queue either with or without the halt/clear patches proper, depending on how soon I get acks/r-bs (hint, hint :) > > Thanks > Farhan > > > > > (It will be "interesting" to see what happens with such a stuck > > subchannel in the calling code; but I don't really see many options. > > Panic seems way too strong; maybe mark the subchannel as "broken; no > > idea how to fix"? But that would be a follow-on patch; I think if we > > avoid the endless loop here, this patch is a real improvement and > > should just go in.) > > > >>>>>> } while (ret == -EBUSY); > >>>>>> out_unlock: > >>>>> > >>>>> > >>>> > >>> > >>> > >> > > > > >
On 04/15/2019 10:18 AM, Cornelia Huck wrote: > On Mon, 15 Apr 2019 09:38:37 -0400 > Farhan Ali <alifm@linux.ibm.com> wrote: > >> On 04/15/2019 04:13 AM, Cornelia Huck wrote: >>> On Fri, 12 Apr 2019 10:38:50 -0400 >>> Farhan Ali <alifm@linux.ibm.com> wrote: >>> >>>> On 04/12/2019 04:10 AM, Cornelia Huck wrote: >>>>> On Thu, 11 Apr 2019 16:30:44 -0400 >>>>> Farhan Ali <alifm@linux.ibm.com> wrote: >>>>> >>>>>> On 04/11/2019 12:24 PM, Cornelia Huck wrote: >>>>>>> On Mon, 8 Apr 2019 17:05:32 -0400 >>>>>>> Farhan Ali <alifm@linux.ibm.com> wrote: >>> >>>>>>> Looking at the possible return codes: >>>>>>> * -ENODEV -> device is not operational anyway, in theory you should even >>>>>>> not need to bother with disabling the subchannel >>>>>>> * -EIO -> we've run out of retries, and the subchannel still is not >>>>>>> idle; I'm not sure if we could do anything here, as disable is >>>>>>> unlikely to work, either >>> >>> (...) >>> >>>> Thinking a little bit more about EIO, if the return code is EIO then it >>>> means we have exhausted all our options with cancel_halt_clear and the >>>> subchannel/device is still status pending, right? >>> >>> Yes. >>> >>>> >>>> I think we should still continue to try and disable the subchannel, >>>> because if not then the subchannel/device could in some point of time >>>> come back and bite us. So we really should protect the system from this >>>> behavior. >>> >>> I think trying to disable the subchannel does not really hurt, but I >>> fear it won't succeed in that case... >>> >>>> >>>> I think for EIO we should log an error message, but still try to >>>> continue with disabling the subchannel. What do you or others think? >>> >>> Logging an error may be useful (it's really fouled up at that time), but... >>> >>>> >>>> >>>> >>>> >>>>>> >>>>>>>> + flush_workqueue(vfio_ccw_work_q); >>>>>>>> + spin_lock_irq(sch->lock); >>>>>>>> ret = cio_disable_subchannel(sch); >>> >>> ...there's a good chance that we'd get -EBUSY here, which would keep us >>> in the loop. We probably need to break out after we got -EIO from >>> cancel_halt_clear, regardless of which return code we get from the >>> disable. >> >> Okay, for EIO we can log an error message and break out of the loop. >> >> I will send a v3. Are you going to queue patch 1 or patch 3 soon? If you >> are then I will just send this patch separately. > > Yes, please do send it separately. I'm currently testing patch 1 and 3 > on top of my patchset, will queue either with or without the halt/clear > patches proper, depending on how soon I get acks/r-bs (hint, hint :) I am going through your patches 4 and 6 and hopefully will get back to you by the end of the day :). > >> >> Thanks >> Farhan >> >>> >>> (It will be "interesting" to see what happens with such a stuck >>> subchannel in the calling code; but I don't really see many options. >>> Panic seems way too strong; maybe mark the subchannel as "broken; no >>> idea how to fix"? But that would be a follow-on patch; I think if we >>> avoid the endless loop here, this patch is a real improvement and >>> should just go in.) >>> >>>>>>>> } while (ret == -EBUSY); >>>>>>>> out_unlock: >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>> >> > >
On Mon, 15 Apr 2019 10:24:53 -0400 Farhan Ali <alifm@linux.ibm.com> wrote: > On 04/15/2019 10:18 AM, Cornelia Huck wrote: > > Yes, please do send it separately. I'm currently testing patch 1 and 3 > > on top of my patchset, will queue either with or without the halt/clear > > patches proper, depending on how soon I get acks/r-bs (hint, hint :) > > I am going through your patches 4 and 6 and hopefully will get back to > you by the end of the day :). Awesome, thanks!
diff --git a/drivers/s390/cio/vfio_ccw_drv.c b/drivers/s390/cio/vfio_ccw_drv.c index 5aca475..4405f2a 100644 --- a/drivers/s390/cio/vfio_ccw_drv.c +++ b/drivers/s390/cio/vfio_ccw_drv.c @@ -43,26 +43,23 @@ int vfio_ccw_sch_quiesce(struct subchannel *sch) if (ret != -EBUSY) goto out_unlock; + iretry = 255; do { - iretry = 255; ret = cio_cancel_halt_clear(sch, &iretry); - while (ret == -EBUSY) { - /* - * Flush all I/O and wait for - * cancel/halt/clear completion. - */ - private->completion = &completion; - spin_unlock_irq(sch->lock); - + /* + * Flush all I/O and wait for + * cancel/halt/clear completion. + */ + private->completion = &completion; + spin_unlock_irq(sch->lock); + + if (ret == -EBUSY) wait_for_completion_timeout(&completion, 3*HZ); - private->completion = NULL; - flush_workqueue(vfio_ccw_work_q); - spin_lock_irq(sch->lock); - ret = cio_cancel_halt_clear(sch, &iretry); - }; - + private->completion = NULL; + flush_workqueue(vfio_ccw_work_q); + spin_lock_irq(sch->lock); ret = cio_disable_subchannel(sch); } while (ret == -EBUSY); out_unlock:
The quiesce function calls cio_cancel_halt_clear() and if we get an -EBUSY we go into a loop where we: - wait for any interrupts - flush all I/O in the workqueue - retry cio_cancel_halt_clear During the period where we are waiting for interrupts or flushing all I/O, the channel subsystem could have completed a halt/clear action and turned off the corresponding activity control bits in the subchannel status word. This means the next time we call cio_cancel_halt_clear(), we will again start by calling cancel subchannel and so we can be stuck between calling cancel and halt forever. Rather than calling cio_cancel_halt_clear() immediately after waiting, let's try to disable the subchannel. If we succeed in disabling the subchannel then we know nothing else can happen with the device. Suggested-by: Eric Farman <farman@linux.ibm.com> Signed-off-by: Farhan Ali <alifm@linux.ibm.com> --- drivers/s390/cio/vfio_ccw_drv.c | 27 ++++++++++++--------------- 1 file changed, 12 insertions(+), 15 deletions(-)