diff mbox series

pci: Skip power-off reset when pending unplug

Message ID 164002480875.2328622.7843917444941101286.stgit@omen (mailing list archive)
State New, archived
Headers show
Series pci: Skip power-off reset when pending unplug | expand

Commit Message

Alex Williamson Dec. 20, 2021, 6:26 p.m. UTC
The below referenced commit introduced a change where devices under a
root port slot are reset in response to removing power to the slot.
This improves emulation relative to bare metal when the slot is powered
off, but introduces an unnecessary step when devices under that slot
are slated for removal.

In the case of an assigned device, there are mandatory delays
associated with many device reset mechanisms which can stall the hot
unplug operation.  Also, in cases where the unplug request is triggered
via a release operation of the host driver, internal device locking in
the host kernel may result in a failure of the device reset mechanism,
which generates unnecessary log warnings.

Skip the reset for devices that are slated for unplug.

Cc: qemu-stable@nongnu.org
Fixes: d5daff7d3126 ("pcie: implement slot power control for pcie root ports")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/pci/pci.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Michael S. Tsirkin Dec. 20, 2021, 11:03 p.m. UTC | #1
On Mon, Dec 20, 2021 at 11:26:59AM -0700, Alex Williamson wrote:
> The below referenced commit introduced a change where devices under a
> root port slot are reset in response to removing power to the slot.
> This improves emulation relative to bare metal when the slot is powered
> off, but introduces an unnecessary step when devices under that slot
> are slated for removal.
> 
> In the case of an assigned device, there are mandatory delays
> associated with many device reset mechanisms which can stall the hot
> unplug operation.  Also, in cases where the unplug request is triggered
> via a release operation of the host driver, internal device locking in
> the host kernel may result in a failure of the device reset mechanism,
> which generates unnecessary log warnings.
> 
> Skip the reset for devices that are slated for unplug.
> 
> Cc: qemu-stable@nongnu.org
> Fixes: d5daff7d3126 ("pcie: implement slot power control for pcie root ports")
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

I am not sure this is safe. IIUC pending_deleted_event
is normally set after host admin requested device removal,
while the reset could be triggered by guest for its own reasons
such as suspend or driver reload.

Looking at this some more, I am not sure I understand the
issue completely.
We have:

    if ((sltsta & PCI_EXP_SLTSTA_PDS) && (val & PCI_EXP_SLTCTL_PCC) &&
        (val & PCI_EXP_SLTCTL_PIC_OFF) == PCI_EXP_SLTCTL_PIC_OFF &&
        (!(old_slt_ctl & PCI_EXP_SLTCTL_PCC) ||
        (old_slt_ctl & PCI_EXP_SLTCTL_PIC_OFF) != PCI_EXP_SLTCTL_PIC_OFF)) {
        pcie_cap_slot_do_unplug(dev);
    }
    pcie_cap_update_power(dev);

so device unplug triggers first, reset follows and by that time
there should be no devices under the bus, if there are then
it's because guest did not clear the power indicator.


So I am not sure how to fix the assignment issues as I'm not sure how do
they trigger, but here is a wild idea: maybe it should support an API
for starting reset asynchronously, then if the following access is
trying to reset again that second reset can just be skipped, while any
other access will stall.





> ---
>  hw/pci/pci.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index e5993c1ef52b..f594da410797 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2869,7 +2869,7 @@ void pci_set_power(PCIDevice *d, bool state)
>      memory_region_set_enabled(&d->bus_master_enable_region,
>                                (pci_get_word(d->config + PCI_COMMAND)
>                                 & PCI_COMMAND_MASTER) && d->has_power);
> -    if (!d->has_power) {
> +    if (!d->has_power && !d->qdev.pending_deleted_event) {
>          pci_device_reset(d);
>      }
>  }
>
Alex Williamson Dec. 21, 2021, 4:36 p.m. UTC | #2
On Mon, 20 Dec 2021 18:03:56 -0500
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Dec 20, 2021 at 11:26:59AM -0700, Alex Williamson wrote:
> > The below referenced commit introduced a change where devices under a
> > root port slot are reset in response to removing power to the slot.
> > This improves emulation relative to bare metal when the slot is powered
> > off, but introduces an unnecessary step when devices under that slot
> > are slated for removal.
> > 
> > In the case of an assigned device, there are mandatory delays
> > associated with many device reset mechanisms which can stall the hot
> > unplug operation.  Also, in cases where the unplug request is triggered
> > via a release operation of the host driver, internal device locking in
> > the host kernel may result in a failure of the device reset mechanism,
> > which generates unnecessary log warnings.
> > 
> > Skip the reset for devices that are slated for unplug.
> > 
> > Cc: qemu-stable@nongnu.org
> > Fixes: d5daff7d3126 ("pcie: implement slot power control for pcie root ports")
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>  
> 
> I am not sure this is safe. IIUC pending_deleted_event
> is normally set after host admin requested device removal,
> while the reset could be triggered by guest for its own reasons
> such as suspend or driver reload.

Right, the case where I mention that we get the warning looks exactly
like the admin doing a device eject, it calls qdev_unplug().  I'm not
trying to prevent arbitrary guest resets of the device, in fact there
are cases where the guest really should be able to reset the device,
nested assignment in addition to the cases you mention.  Gerd noted
that this was an unintended side effect of the referenced patch to
reset device that are imminently being removed.

> Looking at this some more, I am not sure I understand the
> issue completely.
> We have:
> 
>     if ((sltsta & PCI_EXP_SLTSTA_PDS) && (val & PCI_EXP_SLTCTL_PCC) &&
>         (val & PCI_EXP_SLTCTL_PIC_OFF) == PCI_EXP_SLTCTL_PIC_OFF &&
>         (!(old_slt_ctl & PCI_EXP_SLTCTL_PCC) ||
>         (old_slt_ctl & PCI_EXP_SLTCTL_PIC_OFF) != PCI_EXP_SLTCTL_PIC_OFF)) {
>         pcie_cap_slot_do_unplug(dev);
>     }
>     pcie_cap_update_power(dev);
> 
> so device unplug triggers first, reset follows and by that time
> there should be no devices under the bus, if there are then
> it's because guest did not clear the power indicator.

Note that the unplug only triggers here if the Power Indicator Control
is OFF, I see writes to SLTCTL in the following order:

 01f1 - > 02f1 -> 06f1 -> 07f1

So PIC changes to BLINK, then PCC changes the slot to OFF (this
triggers the reset), then PIC changes to OFF triggering the unplug.

The unnecessary reset that occurs here is universal.  Should the unplug
be occurring when:

  (val & PCI_EXP_SLTCTL_PIC_OFF) != PCI_EXP_SLTCTL_PIC_ON

?
 
> So I am not sure how to fix the assignment issues as I'm not sure how do
> they trigger, but here is a wild idea: maybe it should support an API
> for starting reset asynchronously, then if the following access is
> trying to reset again that second reset can just be skipped, while any
> other access will stall.

As above, there's not a concurrency problem, so I don't see how an
async API buys us anything.  It seems the ordering of the slot power
induced reset versus device unplug is not as you expected.  Can we fix
that?  Thanks,

Alex

> > ---
> >  hw/pci/pci.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index e5993c1ef52b..f594da410797 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -2869,7 +2869,7 @@ void pci_set_power(PCIDevice *d, bool state)
> >      memory_region_set_enabled(&d->bus_master_enable_region,
> >                                (pci_get_word(d->config + PCI_COMMAND)
> >                                 & PCI_COMMAND_MASTER) && d->has_power);
> > -    if (!d->has_power) {
> > +    if (!d->has_power && !d->qdev.pending_deleted_event) {
> >          pci_device_reset(d);
> >      }
> >  }
> >   
>
Michael S. Tsirkin Dec. 21, 2021, 11:40 p.m. UTC | #3
On Tue, Dec 21, 2021 at 09:36:56AM -0700, Alex Williamson wrote:
> On Mon, 20 Dec 2021 18:03:56 -0500
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Mon, Dec 20, 2021 at 11:26:59AM -0700, Alex Williamson wrote:
> > > The below referenced commit introduced a change where devices under a
> > > root port slot are reset in response to removing power to the slot.
> > > This improves emulation relative to bare metal when the slot is powered
> > > off, but introduces an unnecessary step when devices under that slot
> > > are slated for removal.
> > > 
> > > In the case of an assigned device, there are mandatory delays
> > > associated with many device reset mechanisms which can stall the hot
> > > unplug operation.  Also, in cases where the unplug request is triggered
> > > via a release operation of the host driver, internal device locking in
> > > the host kernel may result in a failure of the device reset mechanism,
> > > which generates unnecessary log warnings.
> > > 
> > > Skip the reset for devices that are slated for unplug.
> > > 
> > > Cc: qemu-stable@nongnu.org
> > > Fixes: d5daff7d3126 ("pcie: implement slot power control for pcie root ports")
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>  
> > 
> > I am not sure this is safe. IIUC pending_deleted_event
> > is normally set after host admin requested device removal,
> > while the reset could be triggered by guest for its own reasons
> > such as suspend or driver reload.
> 
> Right, the case where I mention that we get the warning looks exactly
> like the admin doing a device eject, it calls qdev_unplug().  I'm not
> trying to prevent arbitrary guest resets of the device, in fact there
> are cases where the guest really should be able to reset the device,
> nested assignment in addition to the cases you mention.  Gerd noted
> that this was an unintended side effect of the referenced patch to
> reset device that are imminently being removed.
> 
> > Looking at this some more, I am not sure I understand the
> > issue completely.
> > We have:
> > 
> >     if ((sltsta & PCI_EXP_SLTSTA_PDS) && (val & PCI_EXP_SLTCTL_PCC) &&
> >         (val & PCI_EXP_SLTCTL_PIC_OFF) == PCI_EXP_SLTCTL_PIC_OFF &&
> >         (!(old_slt_ctl & PCI_EXP_SLTCTL_PCC) ||
> >         (old_slt_ctl & PCI_EXP_SLTCTL_PIC_OFF) != PCI_EXP_SLTCTL_PIC_OFF)) {
> >         pcie_cap_slot_do_unplug(dev);
> >     }
> >     pcie_cap_update_power(dev);
> > 
> > so device unplug triggers first, reset follows and by that time
> > there should be no devices under the bus, if there are then
> > it's because guest did not clear the power indicator.
> 
> Note that the unplug only triggers here if the Power Indicator Control
> is OFF, I see writes to SLTCTL in the following order:
> 
>  01f1 - > 02f1 -> 06f1 -> 07f1
> 
> So PIC changes to BLINK, then PCC changes the slot to OFF (this
> triggers the reset), then PIC changes to OFF triggering the unplug.
> 
> The unnecessary reset that occurs here is universal.  Should the unplug
> be occurring when:
> 
>   (val & PCI_EXP_SLTCTL_PIC_OFF) != PCI_EXP_SLTCTL_PIC_ON
> 
> ?

well blinking generally means "do not remove yet".

> > So I am not sure how to fix the assignment issues as I'm not sure how do
> > they trigger, but here is a wild idea: maybe it should support an API
> > for starting reset asynchronously, then if the following access is
> > trying to reset again that second reset can just be skipped, while any
> > other access will stall.
> 
> As above, there's not a concurrency problem, so I don't see how an
> async API buys us anything.

Well unplug resets the device again, right? Why is that reset not
problematic and this one is?

>  It seems the ordering of the slot power
> induced reset versus device unplug is not as you expected.  Can we fix
> that?  Thanks,
> 
> Alex

Oh I means on the PIC write. That triggers the unplug without triggering
a reset. I was under the impression you are saying the same guest
write triggers both reset and unplug.
Since in this case it's two writes, I don't see how we
can tie ourselves to guest doing things in a specific order.
It can always change the order of things.


> 
> > > ---
> > >  hw/pci/pci.c |    2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > > index e5993c1ef52b..f594da410797 100644
> > > --- a/hw/pci/pci.c
> > > +++ b/hw/pci/pci.c
> > > @@ -2869,7 +2869,7 @@ void pci_set_power(PCIDevice *d, bool state)
> > >      memory_region_set_enabled(&d->bus_master_enable_region,
> > >                                (pci_get_word(d->config + PCI_COMMAND)
> > >                                 & PCI_COMMAND_MASTER) && d->has_power);
> > > -    if (!d->has_power) {
> > > +    if (!d->has_power && !d->qdev.pending_deleted_event) {
> > >          pci_device_reset(d);
> > >      }
> > >  }
> > >   
> >
Alex Williamson Dec. 22, 2021, 7:08 p.m. UTC | #4
On Tue, 21 Dec 2021 18:40:09 -0500
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Dec 21, 2021 at 09:36:56AM -0700, Alex Williamson wrote:
> > On Mon, 20 Dec 2021 18:03:56 -0500
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Mon, Dec 20, 2021 at 11:26:59AM -0700, Alex Williamson wrote:  
> > > > The below referenced commit introduced a change where devices under a
> > > > root port slot are reset in response to removing power to the slot.
> > > > This improves emulation relative to bare metal when the slot is powered
> > > > off, but introduces an unnecessary step when devices under that slot
> > > > are slated for removal.
> > > > 
> > > > In the case of an assigned device, there are mandatory delays
> > > > associated with many device reset mechanisms which can stall the hot
> > > > unplug operation.  Also, in cases where the unplug request is triggered
> > > > via a release operation of the host driver, internal device locking in
> > > > the host kernel may result in a failure of the device reset mechanism,
> > > > which generates unnecessary log warnings.
> > > > 
> > > > Skip the reset for devices that are slated for unplug.
> > > > 
> > > > Cc: qemu-stable@nongnu.org
> > > > Fixes: d5daff7d3126 ("pcie: implement slot power control for pcie root ports")
> > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>    
> > > 
> > > I am not sure this is safe. IIUC pending_deleted_event
> > > is normally set after host admin requested device removal,
> > > while the reset could be triggered by guest for its own reasons
> > > such as suspend or driver reload.  
> > 
> > Right, the case where I mention that we get the warning looks exactly
> > like the admin doing a device eject, it calls qdev_unplug().  I'm not
> > trying to prevent arbitrary guest resets of the device, in fact there
> > are cases where the guest really should be able to reset the device,
> > nested assignment in addition to the cases you mention.  Gerd noted
> > that this was an unintended side effect of the referenced patch to
> > reset device that are imminently being removed.
> >   
> > > Looking at this some more, I am not sure I understand the
> > > issue completely.
> > > We have:
> > > 
> > >     if ((sltsta & PCI_EXP_SLTSTA_PDS) && (val & PCI_EXP_SLTCTL_PCC) &&
> > >         (val & PCI_EXP_SLTCTL_PIC_OFF) == PCI_EXP_SLTCTL_PIC_OFF &&
> > >         (!(old_slt_ctl & PCI_EXP_SLTCTL_PCC) ||
> > >         (old_slt_ctl & PCI_EXP_SLTCTL_PIC_OFF) != PCI_EXP_SLTCTL_PIC_OFF)) {
> > >         pcie_cap_slot_do_unplug(dev);
> > >     }
> > >     pcie_cap_update_power(dev);
> > > 
> > > so device unplug triggers first, reset follows and by that time
> > > there should be no devices under the bus, if there are then
> > > it's because guest did not clear the power indicator.  
> > 
> > Note that the unplug only triggers here if the Power Indicator Control
> > is OFF, I see writes to SLTCTL in the following order:
> > 
> >  01f1 - > 02f1 -> 06f1 -> 07f1
> > 
> > So PIC changes to BLINK, then PCC changes the slot to OFF (this
> > triggers the reset), then PIC changes to OFF triggering the unplug.
> > 
> > The unnecessary reset that occurs here is universal.  Should the unplug
> > be occurring when:
> > 
> >   (val & PCI_EXP_SLTCTL_PIC_OFF) != PCI_EXP_SLTCTL_PIC_ON
> > 
> > ?  
> 
> well blinking generally means "do not remove yet".

Blinking indicates that the slot is in a transition phase, which we
could also interpret to mean that power has been removed and this is
the time required for the power to settle.  By that token, it might be
reasonable that a power state induced reset doesn't actually occur
until the slot reaches both the slot power off and power indicator off
state.  In that case we could reorganize things to let the unplug occur
before the power transition.  Of course the original proposal also
essentially supports this interpretation, the slot power off reset does
not occur for devices with a pending unplug and those devices are
removed after the slot transition grace period.

> > > So I am not sure how to fix the assignment issues as I'm not sure how do
> > > they trigger, but here is a wild idea: maybe it should support an API
> > > for starting reset asynchronously, then if the following access is
> > > trying to reset again that second reset can just be skipped, while any
> > > other access will stall.  
> > 
> > As above, there's not a concurrency problem, so I don't see how an
> > async API buys us anything.  
> 
> Well unplug resets the device again, right? Why is that reset not
> problematic and this one is?

It has the same issue, but there's no log message generated that
worries QE into marking this as a regression.  Obviously the ideal
outcome would be that we could reset the device under these conditions,
but to this point we've only managed to introduce "try" semantics to
the functions to prevent deadlock.  As this is a condition induced by
corner case admin device handling, we've so far considered the reset
failure acceptable.  Thanks,

Alex
Michael S. Tsirkin Dec. 22, 2021, 8:48 p.m. UTC | #5
On Wed, Dec 22, 2021 at 12:08:09PM -0700, Alex Williamson wrote:
> On Tue, 21 Dec 2021 18:40:09 -0500
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Tue, Dec 21, 2021 at 09:36:56AM -0700, Alex Williamson wrote:
> > > On Mon, 20 Dec 2021 18:03:56 -0500
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >   
> > > > On Mon, Dec 20, 2021 at 11:26:59AM -0700, Alex Williamson wrote:  
> > > > > The below referenced commit introduced a change where devices under a
> > > > > root port slot are reset in response to removing power to the slot.
> > > > > This improves emulation relative to bare metal when the slot is powered
> > > > > off, but introduces an unnecessary step when devices under that slot
> > > > > are slated for removal.
> > > > > 
> > > > > In the case of an assigned device, there are mandatory delays
> > > > > associated with many device reset mechanisms which can stall the hot
> > > > > unplug operation.  Also, in cases where the unplug request is triggered
> > > > > via a release operation of the host driver, internal device locking in
> > > > > the host kernel may result in a failure of the device reset mechanism,
> > > > > which generates unnecessary log warnings.
> > > > > 
> > > > > Skip the reset for devices that are slated for unplug.
> > > > > 
> > > > > Cc: qemu-stable@nongnu.org
> > > > > Fixes: d5daff7d3126 ("pcie: implement slot power control for pcie root ports")
> > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>    
> > > > 
> > > > I am not sure this is safe. IIUC pending_deleted_event
> > > > is normally set after host admin requested device removal,
> > > > while the reset could be triggered by guest for its own reasons
> > > > such as suspend or driver reload.  
> > > 
> > > Right, the case where I mention that we get the warning looks exactly
> > > like the admin doing a device eject, it calls qdev_unplug().  I'm not
> > > trying to prevent arbitrary guest resets of the device, in fact there
> > > are cases where the guest really should be able to reset the device,
> > > nested assignment in addition to the cases you mention.  Gerd noted
> > > that this was an unintended side effect of the referenced patch to
> > > reset device that are imminently being removed.
> > >   
> > > > Looking at this some more, I am not sure I understand the
> > > > issue completely.
> > > > We have:
> > > > 
> > > >     if ((sltsta & PCI_EXP_SLTSTA_PDS) && (val & PCI_EXP_SLTCTL_PCC) &&
> > > >         (val & PCI_EXP_SLTCTL_PIC_OFF) == PCI_EXP_SLTCTL_PIC_OFF &&
> > > >         (!(old_slt_ctl & PCI_EXP_SLTCTL_PCC) ||
> > > >         (old_slt_ctl & PCI_EXP_SLTCTL_PIC_OFF) != PCI_EXP_SLTCTL_PIC_OFF)) {
> > > >         pcie_cap_slot_do_unplug(dev);
> > > >     }
> > > >     pcie_cap_update_power(dev);
> > > > 
> > > > so device unplug triggers first, reset follows and by that time
> > > > there should be no devices under the bus, if there are then
> > > > it's because guest did not clear the power indicator.  
> > > 
> > > Note that the unplug only triggers here if the Power Indicator Control
> > > is OFF, I see writes to SLTCTL in the following order:
> > > 
> > >  01f1 - > 02f1 -> 06f1 -> 07f1
> > > 
> > > So PIC changes to BLINK, then PCC changes the slot to OFF (this
> > > triggers the reset), then PIC changes to OFF triggering the unplug.
> > > 
> > > The unnecessary reset that occurs here is universal.  Should the unplug
> > > be occurring when:
> > > 
> > >   (val & PCI_EXP_SLTCTL_PIC_OFF) != PCI_EXP_SLTCTL_PIC_ON
> > > 
> > > ?  
> > 
> > well blinking generally means "do not remove yet".
> 
> Blinking indicates that the slot is in a transition phase,

Well the spec seems to state that blinking indicates it's waiting
to see user does not change his/her mind by pressing the
button again.

> which we
> could also interpret to mean that power has been removed and this is
> the time required for the power to settle.  By that token, it might be
> reasonable that a power state induced reset doesn't actually occur
> until the slot reaches both the slot power off and power indicator off
> state.

The reset is actually just an attempt to approximate power off.
So I'm not sure that is right powering device off and then on
is just a slow but reasonable way for guests to reset a device.



>  In that case we could reorganize things to let the unplug occur
> before the power transition.

Hmm you mean unplug on host immediately when it starts blinking?
But drivers are not notified at this point, are they?

>  Of course the original proposal also
> essentially supports this interpretation, the slot power off reset does
> not occur for devices with a pending unplug and those devices are
> removed after the slot transition grace period.

Meaning the patch you posted? It relies on guest doing a specific
thing though, and guest and host states are not synchronized.


I think it might work to defer reset if it's blinking until it actually
stops blinking. To me it seems a bit less risky but but again, in theory
some guest driver could use the power cycle reset while hotplug plays
with PIC waiting for the cancel button press.
E.g. I suspect your patch can be broken just by guest loading/unloading
driver in a loop while host also triggers plug/unplug.


> > > > So I am not sure how to fix the assignment issues as I'm not sure how do
> > > > they trigger, but here is a wild idea: maybe it should support an API
> > > > for starting reset asynchronously, then if the following access is
> > > > trying to reset again that second reset can just be skipped, while any
> > > > other access will stall.  
> > > 
> > > As above, there's not a concurrency problem, so I don't see how an
> > > async API buys us anything.  
> > 
> > Well unplug resets the device again, right? Why is that reset not
> > problematic and this one is?
> 
> It has the same issue, but there's no log message generated that
> worries QE into marking this as a regression.

Well is the device already stopped from working at this point?
Prevented from getting and responding to guest accesses?
By something else?
Because this is what happens when it's powered off, isn't it?

>  Obviously the ideal
> outcome would be that we could reset the device under these conditions,
> but to this point we've only managed to introduce "try" semantics to
> the functions to prevent deadlock.  As this is a condition induced by
> corner case admin device handling, we've so far considered the reset
> failure acceptable.  Thanks,
> 
> Alex
Alex Williamson Dec. 22, 2021, 11:10 p.m. UTC | #6
On Wed, 22 Dec 2021 15:48:24 -0500
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Dec 22, 2021 at 12:08:09PM -0700, Alex Williamson wrote:
> > On Tue, 21 Dec 2021 18:40:09 -0500
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Tue, Dec 21, 2021 at 09:36:56AM -0700, Alex Williamson wrote:  
> > > > On Mon, 20 Dec 2021 18:03:56 -0500
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > >     
> > > > > On Mon, Dec 20, 2021 at 11:26:59AM -0700, Alex Williamson wrote:    
> > > > > > The below referenced commit introduced a change where devices under a
> > > > > > root port slot are reset in response to removing power to the slot.
> > > > > > This improves emulation relative to bare metal when the slot is powered
> > > > > > off, but introduces an unnecessary step when devices under that slot
> > > > > > are slated for removal.
> > > > > > 
> > > > > > In the case of an assigned device, there are mandatory delays
> > > > > > associated with many device reset mechanisms which can stall the hot
> > > > > > unplug operation.  Also, in cases where the unplug request is triggered
> > > > > > via a release operation of the host driver, internal device locking in
> > > > > > the host kernel may result in a failure of the device reset mechanism,
> > > > > > which generates unnecessary log warnings.
> > > > > > 
> > > > > > Skip the reset for devices that are slated for unplug.
> > > > > > 
> > > > > > Cc: qemu-stable@nongnu.org
> > > > > > Fixes: d5daff7d3126 ("pcie: implement slot power control for pcie root ports")
> > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>      
> > > > > 
> > > > > I am not sure this is safe. IIUC pending_deleted_event
> > > > > is normally set after host admin requested device removal,
> > > > > while the reset could be triggered by guest for its own reasons
> > > > > such as suspend or driver reload.    
> > > > 
> > > > Right, the case where I mention that we get the warning looks exactly
> > > > like the admin doing a device eject, it calls qdev_unplug().  I'm not
> > > > trying to prevent arbitrary guest resets of the device, in fact there
> > > > are cases where the guest really should be able to reset the device,
> > > > nested assignment in addition to the cases you mention.  Gerd noted
> > > > that this was an unintended side effect of the referenced patch to
> > > > reset device that are imminently being removed.
> > > >     
> > > > > Looking at this some more, I am not sure I understand the
> > > > > issue completely.
> > > > > We have:
> > > > > 
> > > > >     if ((sltsta & PCI_EXP_SLTSTA_PDS) && (val & PCI_EXP_SLTCTL_PCC) &&
> > > > >         (val & PCI_EXP_SLTCTL_PIC_OFF) == PCI_EXP_SLTCTL_PIC_OFF &&
> > > > >         (!(old_slt_ctl & PCI_EXP_SLTCTL_PCC) ||
> > > > >         (old_slt_ctl & PCI_EXP_SLTCTL_PIC_OFF) != PCI_EXP_SLTCTL_PIC_OFF)) {
> > > > >         pcie_cap_slot_do_unplug(dev);
> > > > >     }
> > > > >     pcie_cap_update_power(dev);
> > > > > 
> > > > > so device unplug triggers first, reset follows and by that time
> > > > > there should be no devices under the bus, if there are then
> > > > > it's because guest did not clear the power indicator.    
> > > > 
> > > > Note that the unplug only triggers here if the Power Indicator Control
> > > > is OFF, I see writes to SLTCTL in the following order:
> > > > 
> > > >  01f1 - > 02f1 -> 06f1 -> 07f1
> > > > 
> > > > So PIC changes to BLINK, then PCC changes the slot to OFF (this
> > > > triggers the reset), then PIC changes to OFF triggering the unplug.
> > > > 
> > > > The unnecessary reset that occurs here is universal.  Should the unplug
> > > > be occurring when:
> > > > 
> > > >   (val & PCI_EXP_SLTCTL_PIC_OFF) != PCI_EXP_SLTCTL_PIC_ON
> > > > 
> > > > ?    
> > > 
> > > well blinking generally means "do not remove yet".  
> > 
> > Blinking indicates that the slot is in a transition phase,  
> 
> Well the spec seems to state that blinking indicates it's waiting
> to see user does not change his/her mind by pressing the
> button again.

We're dealing with the Power Indicator, not the Attention Indicator
here.

> > which we
> > could also interpret to mean that power has been removed and this is
> > the time required for the power to settle.  By that token, it might be
> > reasonable that a power state induced reset doesn't actually occur
> > until the slot reaches both the slot power off and power indicator off
> > state.  
> 
> The reset is actually just an attempt to approximate power off.
> So I'm not sure that is right powering device off and then on
> is just a slow but reasonable way for guests to reset a device.

Agree, I don't have a problem with resetting devices in response to the
slot being powered off, just that it's pointless, and in some scenarios
causes us additional grief when it occurs when the device is being
removed anyway.

> >  In that case we could reorganize things to let the unplug occur
> > before the power transition.  
> 
> Hmm you mean unplug on host immediately when it starts blinking?
> But drivers are not notified at this point, are they?

I think this is confusing Attention Indicator and Power Indicator
again.  The write sequence I noted for the slot control register was as
follows:

    01f1 - > 02f1 -> 06f1 -> 07f1

 01f1:
   Attention Indicator: OFF
   Power Indicator: ON
   Power Controller: ON

 02f1:
   Power Indicator: ON -> BLINK

 06f1:
   Power Controller: ON -> OFF

 07f1:
   Power Indicator: BLINK -> OFF

The device reset currently occurs at 06f1, when the Power Controller
removes power to the slot.  The unplug doesn't occur until 07f1 when
the Power Indicator turns off.  On bare metal, the device power would
be in some indeterminate state between those two writes as the power
drains.  We've chosen to reset the device at the beginning of this
phase, where power is first removed (ie. instantaneous power drain),
but on a physical device it happens somewhere in between.  It therefore
seems valid that it could happen at the moment the Power Indicator
turns off such that we could precede the device reset with any
necessary unplug operations.

> >  Of course the original proposal also
> > essentially supports this interpretation, the slot power off reset does
> > not occur for devices with a pending unplug and those devices are
> > removed after the slot transition grace period.  
> 
> Meaning the patch you posted? It relies on guest doing a specific
> thing though, and guest and host states are not synchronized.

The proposed patch just means we won't reset the device in response to
slot power if an unplug is pending.  So sure, if it's true that a guest
playing with the Power Controller without using the Power Indicator to
reflect the slot state could skip a device reset, but is that valid
guest behavior?

> I think it might work to defer reset if it's blinking until it actually
> stops blinking. To me it seems a bit less risky but but again, in theory
> some guest driver could use the power cycle reset while hotplug plays
> with PIC waiting for the cancel button press.

That's essentially my previous suggestion above.  The Power Indicator
blinking tells us the slot power is in transition, the option to press
the attention button to abort has passed.  I understand the abort
window to be governed by the Attention Indicator blinking.

> E.g. I suspect your patch can be broken just by guest loading/unloading
> driver in a loop while host also triggers plug/unplug.

It's not clear to me why that might fail.  Can you elaborate?  All it
does is skip the reset when an unplug is pending, but the actual unplug
finalizes the device and any subsequent plug necessarily uses a new
device, so I don't see what goes wrong.

> > > > > So I am not sure how to fix the assignment issues as I'm not sure how do
> > > > > they trigger, but here is a wild idea: maybe it should support an API
> > > > > for starting reset asynchronously, then if the following access is
> > > > > trying to reset again that second reset can just be skipped, while any
> > > > > other access will stall.    
> > > > 
> > > > As above, there's not a concurrency problem, so I don't see how an
> > > > async API buys us anything.    
> > > 
> > > Well unplug resets the device again, right? Why is that reset not
> > > problematic and this one is?  
> > 
> > It has the same issue, but there's no log message generated that
> > worries QE into marking this as a regression.  
> 
> Well is the device already stopped from working at this point?
> Prevented from getting and responding to guest accesses?
> By something else?
> Because this is what happens when it's powered off, isn't it?

The reset failure doesn't prevent the device from being unplugged, QEMU
releases it and it gets disabled by the kernel driver.  Thanks,

Alex

PS - I'm about to sign off for the year, so apologies that this
conversation will need to wait until next year for further follow-up on
my end.  Happy holidays!
Gerd Hoffmann Dec. 23, 2021, 7:11 a.m. UTC | #7
> > > > So I am not sure how to fix the assignment issues as I'm not sure how do
> > > > they trigger, but here is a wild idea: maybe it should support an API
> > > > for starting reset asynchronously, then if the following access is
> > > > trying to reset again that second reset can just be skipped, while any
> > > > other access will stall.  
> > > 
> > > As above, there's not a concurrency problem, so I don't see how an
> > > async API buys us anything.  
> > 
> > Well unplug resets the device again, right? Why is that reset not
> > problematic and this one is?
> 
> It has the same issue, but there's no log message generated that
> worries QE into marking this as a regression.  Obviously the ideal
> outcome would be that we could reset the device under these conditions,
> but to this point we've only managed to introduce "try" semantics to
> the functions to prevent deadlock.  As this is a condition induced by
> corner case admin device handling, we've so far considered the reset
> failure acceptable.  Thanks,

Maybe it makes sense to move the check for pending removal into the vfio
reset handler then, and either skip logging an error or skip reset
altogether?

take care,
  Gerd
Michael S. Tsirkin Dec. 23, 2021, 1:33 p.m. UTC | #8
On Wed, Dec 22, 2021 at 04:10:07PM -0700, Alex Williamson wrote:
> On Wed, 22 Dec 2021 15:48:24 -0500
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Wed, Dec 22, 2021 at 12:08:09PM -0700, Alex Williamson wrote:
> > > On Tue, 21 Dec 2021 18:40:09 -0500
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >   
> > > > On Tue, Dec 21, 2021 at 09:36:56AM -0700, Alex Williamson wrote:  
> > > > > On Mon, 20 Dec 2021 18:03:56 -0500
> > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > >     
> > > > > > On Mon, Dec 20, 2021 at 11:26:59AM -0700, Alex Williamson wrote:    
> > > > > > > The below referenced commit introduced a change where devices under a
> > > > > > > root port slot are reset in response to removing power to the slot.
> > > > > > > This improves emulation relative to bare metal when the slot is powered
> > > > > > > off, but introduces an unnecessary step when devices under that slot
> > > > > > > are slated for removal.
> > > > > > > 
> > > > > > > In the case of an assigned device, there are mandatory delays
> > > > > > > associated with many device reset mechanisms which can stall the hot
> > > > > > > unplug operation.  Also, in cases where the unplug request is triggered
> > > > > > > via a release operation of the host driver, internal device locking in
> > > > > > > the host kernel may result in a failure of the device reset mechanism,
> > > > > > > which generates unnecessary log warnings.
> > > > > > > 
> > > > > > > Skip the reset for devices that are slated for unplug.
> > > > > > > 
> > > > > > > Cc: qemu-stable@nongnu.org
> > > > > > > Fixes: d5daff7d3126 ("pcie: implement slot power control for pcie root ports")
> > > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>      
> > > > > > 
> > > > > > I am not sure this is safe. IIUC pending_deleted_event
> > > > > > is normally set after host admin requested device removal,
> > > > > > while the reset could be triggered by guest for its own reasons
> > > > > > such as suspend or driver reload.    
> > > > > 
> > > > > Right, the case where I mention that we get the warning looks exactly
> > > > > like the admin doing a device eject, it calls qdev_unplug().  I'm not
> > > > > trying to prevent arbitrary guest resets of the device, in fact there
> > > > > are cases where the guest really should be able to reset the device,
> > > > > nested assignment in addition to the cases you mention.  Gerd noted
> > > > > that this was an unintended side effect of the referenced patch to
> > > > > reset device that are imminently being removed.
> > > > >     
> > > > > > Looking at this some more, I am not sure I understand the
> > > > > > issue completely.
> > > > > > We have:
> > > > > > 
> > > > > >     if ((sltsta & PCI_EXP_SLTSTA_PDS) && (val & PCI_EXP_SLTCTL_PCC) &&
> > > > > >         (val & PCI_EXP_SLTCTL_PIC_OFF) == PCI_EXP_SLTCTL_PIC_OFF &&
> > > > > >         (!(old_slt_ctl & PCI_EXP_SLTCTL_PCC) ||
> > > > > >         (old_slt_ctl & PCI_EXP_SLTCTL_PIC_OFF) != PCI_EXP_SLTCTL_PIC_OFF)) {
> > > > > >         pcie_cap_slot_do_unplug(dev);
> > > > > >     }
> > > > > >     pcie_cap_update_power(dev);
> > > > > > 
> > > > > > so device unplug triggers first, reset follows and by that time
> > > > > > there should be no devices under the bus, if there are then
> > > > > > it's because guest did not clear the power indicator.    
> > > > > 
> > > > > Note that the unplug only triggers here if the Power Indicator Control
> > > > > is OFF, I see writes to SLTCTL in the following order:
> > > > > 
> > > > >  01f1 - > 02f1 -> 06f1 -> 07f1
> > > > > 
> > > > > So PIC changes to BLINK, then PCC changes the slot to OFF (this
> > > > > triggers the reset), then PIC changes to OFF triggering the unplug.
> > > > > 
> > > > > The unnecessary reset that occurs here is universal.  Should the unplug
> > > > > be occurring when:
> > > > > 
> > > > >   (val & PCI_EXP_SLTCTL_PIC_OFF) != PCI_EXP_SLTCTL_PIC_ON
> > > > > 
> > > > > ?    
> > > > 
> > > > well blinking generally means "do not remove yet".  
> > > 
> > > Blinking indicates that the slot is in a transition phase,  
> > 
> > Well the spec seems to state that blinking indicates it's waiting
> > to see user does not change his/her mind by pressing the
> > button again.
> 
> We're dealing with the Power Indicator, not the Attention Indicator
> here.

Let's make sure we are talking about the same here:


The Attention Indicator, which must be yellow or amber in color, indicates that an operational
problem exists or that the hot-plug slot is being identified so that a human operator can locate it
easily.

and

Attention Indicator Blinking
A blinking Attention Indicator indicates that system software is identifying this slot for a human
operator to find. This behavior is controlled by a user (for example, from a software user interface
or management tool).


On the other hand:

Power Indicator Off
The Power Indicator in the Off state indicates that insertion or removal of an the adapter is
permitted. Main power to the slot is off if required by the form factor. Note that, depending on the
form factor, other power/signals may remain on, even when main power is off and the Power
Indicator is off. In an example using the PCI Express card form factor, if the platform provides
Vaux to hot-plug slots and the MRL is closed, any signals switched by the MRL are connected to the
slot even when the Power Indicator is off. Signals switched by the MRL are disconnected when the
MRL is opened. System software must cause a slot’s Power Indicator to be turned off when the slot
is not powered and/or it is permissible to insert or remove an adapter. Refer to the appropriate
form factor specification for details.

Power Indicator On
The Power Indicator in the On state indicates that the hot-plug operation is complete and that main
power to the slot is On and that insertion or removal of the adapter is not permitted.

Power Indicator Blinking
A blinking Power Indicator indicates that the slot is powering up or powering down and that
insertion or removal of the adapter is not permitted.
The blinking Power Indicator also provides visual feedback to the operator when the Attention
Button is pressed or when hot-plug operation is initiated through the hot-plug software interface.




> > > which we
> > > could also interpret to mean that power has been removed and this is
> > > the time required for the power to settle.  By that token, it might be
> > > reasonable that a power state induced reset doesn't actually occur
> > > until the slot reaches both the slot power off and power indicator off
> > > state.  
> > 
> > The reset is actually just an attempt to approximate power off.
> > So I'm not sure that is right powering device off and then on
> > is just a slow but reasonable way for guests to reset a device.
> 
> Agree, I don't have a problem with resetting devices in response to the
> slot being powered off, just that it's pointless, and in some scenarios
> causes us additional grief when it occurs when the device is being
> removed anyway.
> 
> > >  In that case we could reorganize things to let the unplug occur
> > > before the power transition.  
> > 
> > Hmm you mean unplug on host immediately when it starts blinking?
> > But drivers are not notified at this point, are they?
> 
> I think this is confusing Attention Indicator and Power Indicator
> again.

Let's try to clear it up.

Here's text from the SHPC spec, pcie spec is less clear imho but
same idea IIUC.

The Power Indicator provides visual feedback to the human operator (if the system
software accepts the request initiated by the Attention Button) by blinking. Once the
Power Indicator begins blinking, a 5-second abort interval exists during which a second
depression of the Attention Button cancels the operation.

Attention Indicator is confusingly unrelated to the Attention Button.
Right?


>  The write sequence I noted for the slot control register was as
> follows:
> 
>     01f1 - > 02f1 -> 06f1 -> 07f1
> 
>  01f1:
>    Attention Indicator: OFF
>    Power Indicator: ON
>    Power Controller: ON
> 
>  02f1:
>    Power Indicator: ON -> BLINK
> 
>  06f1:
>    Power Controller: ON -> OFF
> 
>  07f1:
>    Power Indicator: BLINK -> OFF
> 
> The device reset currently occurs at 06f1, when the Power Controller
> removes power to the slot.  The unplug doesn't occur until 07f1 when
> the Power Indicator turns off.  On bare metal, the device power would
> be in some indeterminate state between those two writes as the power
> drains.  We've chosen to reset the device at the beginning of this
> phase, where power is first removed (ie. instantaneous power drain),
> but on a physical device it happens somewhere in between.

Yes, this is true I think. But I think on bare metal it's guaranteed to
happen within 1 second after power is removed, whatever the state of the
power indicator.
Also, Gerd attempted to add PV code that special cases KVM and
removes all the multi-second waiting from unplug path.
So I am not sure we should rely on the 1 second wait, either.

>  It therefore
> seems valid that it could happen at the moment the Power Indicator
> turns off such that we could precede the device reset with any
> necessary unplug operations.

However the power indicator is merely an indicator for the
human operator. My understanding is that driver that does not want to permit
removing the device can turn off power without turning off
the power indicator.


> > >  Of course the original proposal also
> > > essentially supports this interpretation, the slot power off reset does
> > > not occur for devices with a pending unplug and those devices are
> > > removed after the slot transition grace period.  
> > 
> > Meaning the patch you posted? It relies on guest doing a specific
> > thing though, and guest and host states are not synchronized.
> 
> The proposed patch just means we won't reset the device in response to
> slot power if an unplug is pending.  So sure, if it's true that a guest
> playing with the Power Controller without using the Power Indicator to
> reflect the slot state could skip a device reset, but is that valid
> guest behavior?

I'm not 100% sure:
The Power Indicator in the Off state indicates that insertion or removal of an the adapter is
permitted.

but also

	System software must cause a slot’s Power Indicator to be turned off when the slot
	is not powered and/or it is permissible to insert or remove an adapter.

this and/or confuses me, but I think the "or" here is simply misguided.
The SHPC spec from which the interface in pcie was inherited just says:

	When the Power Indicator is off, it means that main power to the slot is off and that
	insertion or removal of an add-in card is permitted.



> > I think it might work to defer reset if it's blinking until it actually
> > stops blinking. To me it seems a bit less risky but but again, in theory
> > some guest driver could use the power cycle reset while hotplug plays
> > with PIC waiting for the cancel button press.
> 
> That's essentially my previous suggestion above.  The Power Indicator
> blinking tells us the slot power is in transition, the option to press
> the attention button to abort has passed.  I understand the abort
> window to be governed by the Attention Indicator blinking.

what text in the spec gives you that impression?

> > E.g. I suspect your patch can be broken just by guest loading/unloading
> > driver in a loop while host also triggers plug/unplug.
> 
> It's not clear to me why that might fail.  Can you elaborate?  All it
> does is skip the reset when an unplug is pending, but the actual unplug
> finalizes the device and any subsequent plug necessarily uses a new
> device, so I don't see what goes wrong.

host wants to start unplug
meanwhile guest wants to attempt a reset (for its own reasons)
we skip the reset to device retains a bunch of state in
its registers, the guest attempts to drive it assuming
a reset device.

> 
> > > > > > So I am not sure how to fix the assignment issues as I'm not sure how do
> > > > > > they trigger, but here is a wild idea: maybe it should support an API
> > > > > > for starting reset asynchronously, then if the following access is
> > > > > > trying to reset again that second reset can just be skipped, while any
> > > > > > other access will stall.    
> > > > > 
> > > > > As above, there's not a concurrency problem, so I don't see how an
> > > > > async API buys us anything.    
> > > > 
> > > > Well unplug resets the device again, right? Why is that reset not
> > > > problematic and this one is?  
> > > 
> > > It has the same issue, but there's no log message generated that
> > > worries QE into marking this as a regression.  
> > 
> > Well is the device already stopped from working at this point?
> > Prevented from getting and responding to guest accesses?
> > By something else?
> > Because this is what happens when it's powered off, isn't it?
> 
> The reset failure doesn't prevent the device from being unplugged, QEMU
> releases it and it gets disabled by the kernel driver.  Thanks,

But with device retaining state, can it get into a bad state?
Can it also corrupt guest?

> Alex
> 
> PS - I'm about to sign off for the year, so apologies that this
> conversation will need to wait until next year for further follow-up on
> my end.  Happy holidays!

No big deal I think. You too!
Alex Williamson Jan. 5, 2022, 7:17 p.m. UTC | #9
On Thu, 23 Dec 2021 08:33:09 -0500
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Dec 22, 2021 at 04:10:07PM -0700, Alex Williamson wrote:
> > On Wed, 22 Dec 2021 15:48:24 -0500
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Wed, Dec 22, 2021 at 12:08:09PM -0700, Alex Williamson wrote:  
> > > > On Tue, 21 Dec 2021 18:40:09 -0500
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > The reset is actually just an attempt to approximate power off.
> > > So I'm not sure that is right powering device off and then on
> > > is just a slow but reasonable way for guests to reset a device.  
> > 
> > Agree, I don't have a problem with resetting devices in response to the
> > slot being powered off, just that it's pointless, and in some scenarios
> > causes us additional grief when it occurs when the device is being
> > removed anyway.
> >   
> > > >  In that case we could reorganize things to let the unplug occur
> > > > before the power transition.    
> > > 
> > > Hmm you mean unplug on host immediately when it starts blinking?
> > > But drivers are not notified at this point, are they?  
> > 
> > I think this is confusing Attention Indicator and Power Indicator
> > again.  
> 
> Let's try to clear it up.
> 
> Here's text from the SHPC spec, pcie spec is less clear imho but
> same idea IIUC.
> 
> The Power Indicator provides visual feedback to the human operator (if the system
> software accepts the request initiated by the Attention Button) by blinking. Once the
> Power Indicator begins blinking, a 5-second abort interval exists during which a second
> depression of the Attention Button cancels the operation.
> 
> Attention Indicator is confusingly unrelated to the Attention Button.
> Right?

Yeah, I think that's where I was getting confused.  So a qdev_unplug()
results in "pushing" the attention button, the power indicator starts
flashing for 5s, during which an additional attention button press
could cancel the event.  After that 5s period and with the power
indicator still flashing, the power controller is set to off, followed
by the power indicator turning off.

> >  The write sequence I noted for the slot control register was as
> > follows:
> > 
> >     01f1 - > 02f1 -> 06f1 -> 07f1
> > 
> >  01f1:
> >    Attention Indicator: OFF
> >    Power Indicator: ON
> >    Power Controller: ON
> > 
> >  02f1:
> >    Power Indicator: ON -> BLINK
> > 
> >  06f1:
> >    Power Controller: ON -> OFF
> > 
> >  07f1:
> >    Power Indicator: BLINK -> OFF
> > 
> > The device reset currently occurs at 06f1, when the Power Controller
> > removes power to the slot.  The unplug doesn't occur until 07f1 when
> > the Power Indicator turns off.  On bare metal, the device power would
> > be in some indeterminate state between those two writes as the power
> > drains.  We've chosen to reset the device at the beginning of this
> > phase, where power is first removed (ie. instantaneous power drain),
> > but on a physical device it happens somewhere in between.  
> 
> Yes, this is true I think. But I think on bare metal it's guaranteed to
> happen within 1 second after power is removed, whatever the state of the
> power indicator.
> Also, Gerd attempted to add PV code that special cases KVM and
> removes all the multi-second waiting from unplug path.
> So I am not sure we should rely on the 1 second wait, either.

Right, if we don't reset when power is removed we're in a guessing game
of whether the guest is following our assumed transitions.

> >  It therefore
> > seems valid that it could happen at the moment the Power Indicator
> > turns off such that we could precede the device reset with any
> > necessary unplug operations.  
> 
> However the power indicator is merely an indicator for the
> human operator. My understanding is that driver that does not want to permit
> removing the device can turn off power without turning off
> the power indicator.

Yes, on bare metal there's likely some small window where the device
power state is indeterminate, but to take advantage of that we'd need
to do something like setup a 2s timer to reset the device, where that
timer gets canceled if the power indicator turns off in the meantime.
It's a lot of heuristics.

> > > >  Of course the original proposal also
> > > > essentially supports this interpretation, the slot power off reset does
> > > > not occur for devices with a pending unplug and those devices are
> > > > removed after the slot transition grace period.    
> > > 
> > > Meaning the patch you posted? It relies on guest doing a specific
> > > thing though, and guest and host states are not synchronized.  
> > 
> > The proposed patch just means we won't reset the device in response to
> > slot power if an unplug is pending.  So sure, if it's true that a guest
> > playing with the Power Controller without using the Power Indicator to
> > reflect the slot state could skip a device reset, but is that valid
> > guest behavior?  
> 
> I'm not 100% sure:
> The Power Indicator in the Off state indicates that insertion or removal of an the adapter is
> permitted.
> 
> but also
> 
> 	System software must cause a slot’s Power Indicator to be turned off when the slot
> 	is not powered and/or it is permissible to insert or remove an adapter.
> 
> this and/or confuses me, but I think the "or" here is simply misguided.
> The SHPC spec from which the interface in pcie was inherited just says:
> 
> 	When the Power Indicator is off, it means that main power to the slot is off and that
> 	insertion or removal of an add-in card is permitted.

I think the power indicator is clearly our guideline for when we're
allowed to insert or remove a device from the slot.  Other than
resetting when slot power is removed as we do now, the above timer hack
is the only solution I can think of that guarantees we eventually reset
the device after power is removed without relying on the power
indicator transition.  But I'm not sure it's worthwhile.

> > > I think it might work to defer reset if it's blinking until it actually
> > > stops blinking. To me it seems a bit less risky but but again, in theory
> > > some guest driver could use the power cycle reset while hotplug plays
> > > with PIC waiting for the cancel button press.  
> > 
> > That's essentially my previous suggestion above.  The Power Indicator
> > blinking tells us the slot power is in transition, the option to press
> > the attention button to abort has passed.  I understand the abort
> > window to be governed by the Attention Indicator blinking.  
> 
> what text in the spec gives you that impression?

My misunderstanding of the attention vs power indicators.

> > > E.g. I suspect your patch can be broken just by guest loading/unloading
> > > driver in a loop while host also triggers plug/unplug.  
> > 
> > It's not clear to me why that might fail.  Can you elaborate?  All it
> > does is skip the reset when an unplug is pending, but the actual unplug
> > finalizes the device and any subsequent plug necessarily uses a new
> > device, so I don't see what goes wrong.  
> 
> host wants to start unplug
> meanwhile guest wants to attempt a reset (for its own reasons)
> we skip the reset to device retains a bunch of state in
> its registers, the guest attempts to drive it assuming
> a reset device.

Under the condition of the kernel device lock being held, we can't
reset the device anyway.  But yes, we don't need to extend that
vfio-pci limitation to other devices.  Unless you're interested in
pursuing a timer based solution, which I guess would model a physical
system with super capacitors on the power rail that will drain
eventually, or instantly with the power indicator turning off, I think
the best we can do at the moment is to clean up the error reporting in
vfio-pci, report that the reset failed, but not some obscure error
about un-owned groups because we've fallen through to unexpected reset
methods.  A reasonable error message for this condition can be
considered a feature rather than a regression.  It's arguably correct
to try to reset the device here.  I'll post a different patch where we
clean up the vfio-pci error reporting for this case.  Thanks,

Alex
diff mbox series

Patch

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index e5993c1ef52b..f594da410797 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2869,7 +2869,7 @@  void pci_set_power(PCIDevice *d, bool state)
     memory_region_set_enabled(&d->bus_master_enable_region,
                               (pci_get_word(d->config + PCI_COMMAND)
                                & PCI_COMMAND_MASTER) && d->has_power);
-    if (!d->has_power) {
+    if (!d->has_power && !d->qdev.pending_deleted_event) {
         pci_device_reset(d);
     }
 }