pci: Save and restore VFs as a part of a reset

Message ID	20140505212346.18767.34117.stgit@ahduyck-cp2.jf.intel.com (mailing list archive)
State	New, archived
Delegated to:	Bjorn Helgaas
Headers	show Return-Path: <linux-pci-owner@kernel.org> Subject: [PATCH] pci: Save and restore VFs as a part of a reset From: Alexander Duyck <alexander.h.duyck@intel.com> To: bhelgaas@google.com, linux-pci@vger.kernel.org Cc: linux-kernel@vger.kernel.org, jeffrey.t.kirsher@intel.com Date: Mon, 05 May 2014 14:25:17 -0700 Message-ID: <20140505212346.18767.34117.stgit@ahduyck-cp2.jf.intel.com> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-pci-owner@vger.kernel.org Precedence: bulk

Duyck, Alexander H May 5, 2014, 9:25 p.m. UTC

This fixes an issue I found in which triggering a reset via the PCI sysfs
reset while SR-IOV was enabled would leave the VFs in a state in which the
BME and MSI-X enable bits were all cleared.

To correct that I have added code so that the VF state is saved and restored
as a part of the PF save and restore state functions.  By doing this the VF
state is restored as well as the IOV state allowing the VFs to resume function
following a reset.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/pci/iov.c |   48 ++++++++++++++++++++++++++++++++++++++++++++++--
 drivers/pci/pci.c |    2 ++
 drivers/pci/pci.h |    5 +++++
 3 files changed, 53 insertions(+), 2 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bjorn Helgaas May 27, 2014, 10:22 p.m. UTC | #1

On Mon, May 05, 2014 at 02:25:17PM -0700, Alexander Duyck wrote:
> This fixes an issue I found in which triggering a reset via the PCI sysfs
> reset while SR-IOV was enabled would leave the VFs in a state in which the
> BME and MSI-X enable bits were all cleared.
> 
> To correct that I have added code so that the VF state is saved and restored
> as a part of the PF save and restore state functions.  By doing this the VF
> state is restored as well as the IOV state allowing the VFs to resume function
> following a reset.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
>  drivers/pci/iov.c |   48 ++++++++++++++++++++++++++++++++++++++++++++++--
>  drivers/pci/pci.c |    2 ++
>  drivers/pci/pci.h |    5 +++++
>  3 files changed, 53 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index de7a747..645ed71 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -521,13 +521,57 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
>  }
>  
>  /**
> + * pci_save_iov_state - Save the state of the VF configurations
> + * @dev: the PCI device
> + */
> +int pci_save_iov_state(struct pci_dev *dev)
> +{
> +	struct pci_dev *vfdev = NULL;
> +	unsigned short dev_id;
> +
> +	/* only search if we are a PF */
> +	if (!dev->is_physfn)
> +		return 0;
> +
> +	/* retrieve VF device ID */
> +	pci_read_config_word(dev, dev->sriov->pos + PCI_SRIOV_VF_DID, &dev_id);

This is just an optimization to reduce the number of things you get
back from pci_get_device(), and hence reduce the number of times the
"vfdev->is_virtfn && (vfdev->physfn == dev)" condition is false, right?

> +	/* loop through all the VFs and save their state information */
> +	while ((vfdev = pci_get_device(dev->vendor, dev_id, vfdev))) {
> +		if (vfdev->is_virtfn && (vfdev->physfn == dev)) {
> +			int err = pci_save_state(vfdev);

It makes me uneasy to operate on another device (we're resetting A, and
here we save state for B).  I know B is dependent on A, since B is a VF
related to PF A, but what synchronization is there to serialize this
against any other save/restore operations that may be in progress by B's
driver or by a sysfs operation on B?

Is there anything in the reset path that pays attention to whether
resetting this PF will clobber VFs?  Do we care whether those VFs are in
use?  I assume they might be in use by guests?

> +			if (err)
> +				return err;
> +		}
> +	}

pci_get_device() acquires a reference on each device it returns, so this
strategy would require a pci_dev_put().

But I'm not really keen on pci_get_device() in the first place.  It works
by iterating over all PCI devices in the system, which seems like a
sledgehammer approach.  It *is* widely used, but mostly in quirk-type code
from which I avert my eyes.

Maybe you could do something based on pci_walk_bus()?  If you did that, I
think the PCI_SRIOV_VF_DID would become superfluous.

> +
> +	return 0;
> +}
> +
> +/**
>   * pci_restore_iov_state - restore the state of the IOV capability
>   * @dev: the PCI device
>   */
>  void pci_restore_iov_state(struct pci_dev *dev)
>  {
> -	if (dev->is_physfn)
> -		sriov_restore_state(dev);
> +	struct pci_dev *vfdev = NULL;
> +	unsigned short dev_id;
> +
> +	if (!dev->is_physfn)
> +		return;
> +
> +	sriov_restore_state(dev);
> +
> +	/* retrieve VF device ID */
> +	pci_read_config_word(dev, dev->sriov->pos + PCI_SRIOV_VF_DID, &dev_id);
> +
> +	/* loop through all VFs and restore state for any of our VFs */
> +	while ((vfdev = pci_get_device(dev->vendor, dev_id, vfdev))) {
> +		if (vfdev->is_virtfn && (vfdev->physfn == dev))
> +			pci_restore_state(vfdev);
> +	}
> +
> +	return 0;
>  }
>  
>  /**
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 7325d43..120a432 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1015,6 +1015,8 @@ pci_save_state(struct pci_dev *dev)
>  		return i;
>  	if ((i = pci_save_vc_state(dev)) != 0)
>  		return i;
> +	if ((i = pci_save_iov_state(dev)) != 0)
> +		return i;
>  	return 0;
>  }
>  
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 6bd0822..9185edd 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -254,6 +254,7 @@ void pci_iov_release(struct pci_dev *dev);
>  int pci_iov_resource_bar(struct pci_dev *dev, int resno,
>  			 enum pci_bar_type *type);
>  resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno);
> +int pci_save_iov_state(struct pci_dev *dev);
>  void pci_restore_iov_state(struct pci_dev *dev);
>  int pci_iov_bus_range(struct pci_bus *bus);
>  
> @@ -271,6 +272,10 @@ static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno,
>  {
>  	return 0;
>  }
> +static inline void pci_save_iov_state(struct pci_dev *dev)
> +{
> +	return 0;
> +}
>  static inline void pci_restore_iov_state(struct pci_dev *dev)
>  {
>  }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Duyck, Alexander H May 27, 2014, 11:53 p.m. UTC | #2

On 05/27/2014 03:22 PM, Bjorn Helgaas wrote:
> On Mon, May 05, 2014 at 02:25:17PM -0700, Alexander Duyck wrote:
>> This fixes an issue I found in which triggering a reset via the PCI sysfs
>> reset while SR-IOV was enabled would leave the VFs in a state in which the
>> BME and MSI-X enable bits were all cleared.
>>
>> To correct that I have added code so that the VF state is saved and restored
>> as a part of the PF save and restore state functions.  By doing this the VF
>> state is restored as well as the IOV state allowing the VFs to resume function
>> following a reset.
>>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> ---
>>  drivers/pci/iov.c |   48 ++++++++++++++++++++++++++++++++++++++++++++++--
>>  drivers/pci/pci.c |    2 ++
>>  drivers/pci/pci.h |    5 +++++
>>  3 files changed, 53 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>> index de7a747..645ed71 100644
>> --- a/drivers/pci/iov.c
>> +++ b/drivers/pci/iov.c
>> @@ -521,13 +521,57 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
>>  }
>>  
>>  /**
>> + * pci_save_iov_state - Save the state of the VF configurations
>> + * @dev: the PCI device
>> + */
>> +int pci_save_iov_state(struct pci_dev *dev)
>> +{
>> +	struct pci_dev *vfdev = NULL;
>> +	unsigned short dev_id;
>> +
>> +	/* only search if we are a PF */
>> +	if (!dev->is_physfn)
>> +		return 0;
>> +
>> +	/* retrieve VF device ID */
>> +	pci_read_config_word(dev, dev->sriov->pos + PCI_SRIOV_VF_DID, &dev_id);
> 
> This is just an optimization to reduce the number of things you get
> back from pci_get_device(), and hence reduce the number of times the
> "vfdev->is_virtfn && (vfdev->physfn == dev)" condition is false, right?
> 

Yes, this is just a mechanism to reduce the number of items we have to test.

>> +	/* loop through all the VFs and save their state information */
>> +	while ((vfdev = pci_get_device(dev->vendor, dev_id, vfdev))) {
>> +		if (vfdev->is_virtfn && (vfdev->physfn == dev)) {
>> +			int err = pci_save_state(vfdev);
> 
> It makes me uneasy to operate on another device (we're resetting A, and
> here we save state for B).  I know B is dependent on A, since B is a VF
> related to PF A, but what synchronization is there to serialize this
> against any other save/restore operations that may be in progress by B's
> driver or by a sysfs operation on B?

I don't believe there is any synchronization mechanism in place
currently.  I can look into that as well.  Odds are we probably need to
have the VFs check the parent lock before they take any independent action.

> 
> Is there anything in the reset path that pays attention to whether
> resetting this PF will clobber VFs?  Do we care whether those VFs are in
> use?  I assume they might be in use by guests?

The problem I found was that the sysfs reset call doesn't bother to
check with the PF driver at all.  It just clobbers the PF and any VFs on
it without talking to the PF driver.

> 
>> +			if (err)
>> +				return err;
>> +		}
>> +	}
> 
> pci_get_device() acquires a reference on each device it returns, so this
> strategy would require a pci_dev_put().

Yes, if I am not mistaken the pci_dev_put is called as a part of
pci_get_dev_by_id which is what pci_get_device ends up being.

> 
> But I'm not really keen on pci_get_device() in the first place.  It works
> by iterating over all PCI devices in the system, which seems like a
> sledgehammer approach.  It *is* widely used, but mostly in quirk-type code
> from which I avert my eyes.
> 
> Maybe you could do something based on pci_walk_bus()?  If you did that, I
> think the PCI_SRIOV_VF_DID would become superfluous.
> 

I can look into that, I'm not familiar with the interface.  I'll have to
see what the relationship is between the PF and VF in terms of busses as
I don't recall it off of the top of my head.

If it is simple enough I might submit a second patch to update the
pci_vfs_assigned since it is using similar logic.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bjorn Helgaas May 28, 2014, 1:19 a.m. UTC | #3

[+cc Alex, Don]

On Tue, May 27, 2014 at 5:53 PM, Alexander Duyck
<alexander.h.duyck@intel.com> wrote:
> On 05/27/2014 03:22 PM, Bjorn Helgaas wrote:
>> On Mon, May 05, 2014 at 02:25:17PM -0700, Alexander Duyck wrote:
>>> This fixes an issue I found in which triggering a reset via the PCI sysfs
>>> reset while SR-IOV was enabled would leave the VFs in a state in which the
>>> BME and MSI-X enable bits were all cleared.
>>>
>>> To correct that I have added code so that the VF state is saved and restored
>>> as a part of the PF save and restore state functions.  By doing this the VF
>>> state is restored as well as the IOV state allowing the VFs to resume function
>>> following a reset.
>>>
>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>> ---
>>>  drivers/pci/iov.c |   48 ++++++++++++++++++++++++++++++++++++++++++++++--
>>>  drivers/pci/pci.c |    2 ++
>>>  drivers/pci/pci.h |    5 +++++
>>>  3 files changed, 53 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>>> index de7a747..645ed71 100644
>>> --- a/drivers/pci/iov.c
>>> +++ b/drivers/pci/iov.c
>>> @@ -521,13 +521,57 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
>>>  }
>>>
>>>  /**
>>> + * pci_save_iov_state - Save the state of the VF configurations
>>> + * @dev: the PCI device
>>> + */
>>> +int pci_save_iov_state(struct pci_dev *dev)
>>> +{
>>> +    struct pci_dev *vfdev = NULL;
>>> +    unsigned short dev_id;
>>> +
>>> +    /* only search if we are a PF */
>>> +    if (!dev->is_physfn)
>>> +            return 0;
>>> +
>>> +    /* retrieve VF device ID */
>>> +    pci_read_config_word(dev, dev->sriov->pos + PCI_SRIOV_VF_DID, &dev_id);
...

>>> +    /* loop through all the VFs and save their state information */
>>> +    while ((vfdev = pci_get_device(dev->vendor, dev_id, vfdev))) {
>>> +            if (vfdev->is_virtfn && (vfdev->physfn == dev)) {
>>> +                    int err = pci_save_state(vfdev);
>>
>> It makes me uneasy to operate on another device (we're resetting A, and
>> here we save state for B).  I know B is dependent on A, since B is a VF
>> related to PF A, but what synchronization is there to serialize this
>> against any other save/restore operations that may be in progress by B's
>> driver or by a sysfs operation on B?
>
> I don't believe there is any synchronization mechanism in place
> currently.  I can look into that as well.  Odds are we probably need to
> have the VFs check the parent lock before they take any independent action.

It's just the whole question of how we manage the single "saved-state"
area.  Right now, I think almost all use of it is under control of the
driver that owns the device, in suspend/resume methods.  The
exceptions are the PM suspend/freeze/etc. routines in
pci/pci-driver.c, which I assume prevent the driver from running and
are therefore safe, and the reset path.  I don't know how the

>> Is there anything in the reset path that pays attention to whether
>> resetting this PF will clobber VFs?  Do we care whether those VFs are in
>> use?  I assume they might be in use by guests?
>
> The problem I found was that the sysfs reset call doesn't bother to
> check with the PF driver at all.  It just clobbers the PF and any VFs on
> it without talking to the PF driver.

There is Keith Busch's recent patch:
http://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/commit/?h=pci/hotplug&id=3ebe7f9f7e4a4fd1f6461ecd01ff2961317a483a
.  I dunno if that's useful to you or not.

And I'm not sure there's actually a requirement to *have* a PF driver.
 Obviously there has to be a way to enable the VFs, but once they're
enabled, it might be possible to keep using them via VF drivers even
without a PF driver in the picture.

Maybe resetting the PF should just fail if there's an active VF.  If
you need to reset the PF, you'd have to unbind the VFs first.

>>> +                    if (err)
>>> +                            return err;
>>> +            }
>>> +    }
>>
>> pci_get_device() acquires a reference on each device it returns, so this
>> strategy would require a pci_dev_put().
>
> Yes, if I am not mistaken the pci_dev_put is called as a part of
> pci_get_dev_by_id which is what pci_get_device ends up being.

Oh, yeah, you're right.  I forgot about that.  Since you call it in a
loop until you get NULL back, you're OK.  It's only when you stop
before you get NULL that you have to deal with the extra reference.

>> But I'm not really keen on pci_get_device() in the first place.  It works
>> by iterating over all PCI devices in the system, which seems like a
>> sledgehammer approach.  It *is* widely used, but mostly in quirk-type code
>> from which I avert my eyes.
>>
>> Maybe you could do something based on pci_walk_bus()?  If you did that, I
>> think the PCI_SRIOV_VF_DID would become superfluous.
>>
>
> I can look into that, I'm not familiar with the interface.  I'll have to
> see what the relationship is between the PF and VF in terms of busses as
> I don't recall it off of the top of my head.

This reminds me about an open problem: VFs can be on "virtual" buses,
which aren't really connected in the hierarchy, and I don't think we
have a nice way to iterate over them.  So probably pci_get_device() is
the best we can do now.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alex Williamson May 28, 2014, 4:12 a.m. UTC | #4

On Tue, 2014-05-27 at 19:19 -0600, Bjorn Helgaas wrote:
> [+cc Alex, Don]
> 
> On Tue, May 27, 2014 at 5:53 PM, Alexander Duyck
> <alexander.h.duyck@intel.com> wrote:
> > On 05/27/2014 03:22 PM, Bjorn Helgaas wrote:
> >> On Mon, May 05, 2014 at 02:25:17PM -0700, Alexander Duyck wrote:
> >>> This fixes an issue I found in which triggering a reset via the PCI sysfs
> >>> reset while SR-IOV was enabled would leave the VFs in a state in which the
> >>> BME and MSI-X enable bits were all cleared.
> >>>
> >>> To correct that I have added code so that the VF state is saved and restored
> >>> as a part of the PF save and restore state functions.  By doing this the VF
> >>> state is restored as well as the IOV state allowing the VFs to resume function
> >>> following a reset.
> >>>
> >>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> >>> ---
> >>>  drivers/pci/iov.c |   48 ++++++++++++++++++++++++++++++++++++++++++++++--
> >>>  drivers/pci/pci.c |    2 ++
> >>>  drivers/pci/pci.h |    5 +++++
> >>>  3 files changed, 53 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> >>> index de7a747..645ed71 100644
> >>> --- a/drivers/pci/iov.c
> >>> +++ b/drivers/pci/iov.c
> >>> @@ -521,13 +521,57 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
> >>>  }
> >>>
> >>>  /**
> >>> + * pci_save_iov_state - Save the state of the VF configurations
> >>> + * @dev: the PCI device
> >>> + */
> >>> +int pci_save_iov_state(struct pci_dev *dev)
> >>> +{
> >>> +    struct pci_dev *vfdev = NULL;
> >>> +    unsigned short dev_id;
> >>> +
> >>> +    /* only search if we are a PF */
> >>> +    if (!dev->is_physfn)
> >>> +            return 0;
> >>> +
> >>> +    /* retrieve VF device ID */
> >>> +    pci_read_config_word(dev, dev->sriov->pos + PCI_SRIOV_VF_DID, &dev_id);
> ...
> 
> >>> +    /* loop through all the VFs and save their state information */
> >>> +    while ((vfdev = pci_get_device(dev->vendor, dev_id, vfdev))) {
> >>> +            if (vfdev->is_virtfn && (vfdev->physfn == dev)) {
> >>> +                    int err = pci_save_state(vfdev);
> >>
> >> It makes me uneasy to operate on another device (we're resetting A, and
> >> here we save state for B).  I know B is dependent on A, since B is a VF
> >> related to PF A, but what synchronization is there to serialize this
> >> against any other save/restore operations that may be in progress by B's
> >> driver or by a sysfs operation on B?
> >
> > I don't believe there is any synchronization mechanism in place
> > currently.  I can look into that as well.  Odds are we probably need to
> > have the VFs check the parent lock before they take any independent action.
> 
> It's just the whole question of how we manage the single "saved-state"
> area.  Right now, I think almost all use of it is under control of the
> driver that owns the device, in suspend/resume methods.  The
> exceptions are the PM suspend/freeze/etc. routines in
> pci/pci-driver.c, which I assume prevent the driver from running and
> are therefore safe, and the reset path.  I don't know how the

Makes me a little uneasy too, what happens to a transaction headed
to/from the VF while the PF is in a reset state?  I suspect not good
things.  OTOH, the reset interface and a good bit of pci-sysfs have
always been at-your-own-risk interfaces and this restores some bits that
might get us closer to it being survivable.

We do have a way for drivers to get a long-term save state that they can
keep on their own, pci_save_state(); pci_store_saved_state() along with
pci_load_saved_state(); pci_restore_state().  Both KVM and VFIO use this
for assigning a device so we can attempt to re-load the pre-assigned
saved state.

> >> Is there anything in the reset path that pays attention to whether
> >> resetting this PF will clobber VFs?  Do we care whether those VFs are in
> >> use?  I assume they might be in use by guests?
> >
> > The problem I found was that the sysfs reset call doesn't bother to
> > check with the PF driver at all.  It just clobbers the PF and any VFs on
> > it without talking to the PF driver.
> 
> There is Keith Busch's recent patch:
> http://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/commit/?h=pci/hotplug&id=3ebe7f9f7e4a4fd1f6461ecd01ff2961317a483a
> .  I dunno if that's useful to you or not.
> 
> And I'm not sure there's actually a requirement to *have* a PF driver.
>  Obviously there has to be a way to enable the VFs, but once they're
> enabled, it might be possible to keep using them via VF drivers even
> without a PF driver in the picture.
> 
> Maybe resetting the PF should just fail if there's an active VF.  If
> you need to reset the PF, you'd have to unbind the VFs first.

The use case is certainly questionable, personally I'm not going to
expect VFs to continue working after the PF is reset.  Driver binding
gets complicated, especially when KVM doesn't actually bind devices to
use them.  Hopefully we'll get that out of the tree some day though.  I
suppose we could -EBUSY the PF reset as long as VFs are enabled.

> >>> +                    if (err)
> >>> +                            return err;
> >>> +            }
> >>> +    }
> >>
> >> pci_get_device() acquires a reference on each device it returns, so this
> >> strategy would require a pci_dev_put().
> >
> > Yes, if I am not mistaken the pci_dev_put is called as a part of
> > pci_get_dev_by_id which is what pci_get_device ends up being.
> 
> Oh, yeah, you're right.  I forgot about that.  Since you call it in a
> loop until you get NULL back, you're OK.  It's only when you stop
> before you get NULL that you have to deal with the extra reference.
> 
> >> But I'm not really keen on pci_get_device() in the first place.  It works
> >> by iterating over all PCI devices in the system, which seems like a
> >> sledgehammer approach.  It *is* widely used, but mostly in quirk-type code
> >> from which I avert my eyes.
> >>
> >> Maybe you could do something based on pci_walk_bus()?  If you did that, I
> >> think the PCI_SRIOV_VF_DID would become superfluous.
> >>
> >
> > I can look into that, I'm not familiar with the interface.  I'll have to
> > see what the relationship is between the PF and VF in terms of busses as
> > I don't recall it off of the top of my head.
> 
> This reminds me about an open problem: VFs can be on "virtual" buses,
> which aren't really connected in the hierarchy, and I don't think we
> have a nice way to iterate over them.  So probably pci_get_device() is
> the best we can do now.

Yeah, those virtual buses don't have a bus->self, we just have to skip
to bus->parent->self.  pci_walk_bus() goes in the opposite direction,
but without an actual device hosting the bus, I don't see how it finds
it.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Duyck, Alexander H May 28, 2014, 4:39 p.m. UTC | #5

On 05/27/2014 09:12 PM, Alex Williamson wrote:
> On Tue, 2014-05-27 at 19:19 -0600, Bjorn Helgaas wrote:
>> [+cc Alex, Don]
>>
>> On Tue, May 27, 2014 at 5:53 PM, Alexander Duyck
>> <alexander.h.duyck@intel.com> wrote:
>>> On 05/27/2014 03:22 PM, Bjorn Helgaas wrote:
>>>> On Mon, May 05, 2014 at 02:25:17PM -0700, Alexander Duyck wrote:
>>>>> This fixes an issue I found in which triggering a reset via the PCI sysfs
>>>>> reset while SR-IOV was enabled would leave the VFs in a state in which the
>>>>> BME and MSI-X enable bits were all cleared.
>>>>>
>>>>> To correct that I have added code so that the VF state is saved and restored
>>>>> as a part of the PF save and restore state functions.  By doing this the VF
>>>>> state is restored as well as the IOV state allowing the VFs to resume function
>>>>> following a reset.
>>>>>
>>>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>>>> ---
>>>>>  drivers/pci/iov.c |   48 ++++++++++++++++++++++++++++++++++++++++++++++--
>>>>>  drivers/pci/pci.c |    2 ++
>>>>>  drivers/pci/pci.h |    5 +++++
>>>>>  3 files changed, 53 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>>>>> index de7a747..645ed71 100644
>>>>> --- a/drivers/pci/iov.c
>>>>> +++ b/drivers/pci/iov.c
>>>>> @@ -521,13 +521,57 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
>>>>>  }
>>>>>
>>>>>  /**
>>>>> + * pci_save_iov_state - Save the state of the VF configurations
>>>>> + * @dev: the PCI device
>>>>> + */
>>>>> +int pci_save_iov_state(struct pci_dev *dev)
>>>>> +{
>>>>> +    struct pci_dev *vfdev = NULL;
>>>>> +    unsigned short dev_id;
>>>>> +
>>>>> +    /* only search if we are a PF */
>>>>> +    if (!dev->is_physfn)
>>>>> +            return 0;
>>>>> +
>>>>> +    /* retrieve VF device ID */
>>>>> +    pci_read_config_word(dev, dev->sriov->pos + PCI_SRIOV_VF_DID, &dev_id);
>> ...
>>
>>>>> +    /* loop through all the VFs and save their state information */
>>>>> +    while ((vfdev = pci_get_device(dev->vendor, dev_id, vfdev))) {
>>>>> +            if (vfdev->is_virtfn && (vfdev->physfn == dev)) {
>>>>> +                    int err = pci_save_state(vfdev);
>>>>
>>>> It makes me uneasy to operate on another device (we're resetting A, and
>>>> here we save state for B).  I know B is dependent on A, since B is a VF
>>>> related to PF A, but what synchronization is there to serialize this
>>>> against any other save/restore operations that may be in progress by B's
>>>> driver or by a sysfs operation on B?
>>>
>>> I don't believe there is any synchronization mechanism in place
>>> currently.  I can look into that as well.  Odds are we probably need to
>>> have the VFs check the parent lock before they take any independent action.
>>
>> It's just the whole question of how we manage the single "saved-state"
>> area.  Right now, I think almost all use of it is under control of the
>> driver that owns the device, in suspend/resume methods.  The
>> exceptions are the PM suspend/freeze/etc. routines in
>> pci/pci-driver.c, which I assume prevent the driver from running and
>> are therefore safe, and the reset path.  I don't know how the
> 
> Makes me a little uneasy too, what happens to a transaction headed
> to/from the VF while the PF is in a reset state?  I suspect not good
> things.  OTOH, the reset interface and a good bit of pci-sysfs have
> always been at-your-own-risk interfaces and this restores some bits that
> might get us closer to it being survivable.
> 
> We do have a way for drivers to get a long-term save state that they can
> keep on their own, pci_save_state(); pci_store_saved_state() along with
> pci_load_saved_state(); pci_restore_state().  Both KVM and VFIO use this
> for assigning a device so we can attempt to re-load the pre-assigned
> saved state.

So it sounds like this patch would interfere with the functioning of KVM
and VFIO then.  So perhaps I should just not take the direct approach of
saving it myself then.  Perhaps it would be better to just use the
notifier provided below to notify the VFs that an event has occurred.

>>>> Is there anything in the reset path that pays attention to whether
>>>> resetting this PF will clobber VFs?  Do we care whether those VFs are in
>>>> use?  I assume they might be in use by guests?
>>>
>>> The problem I found was that the sysfs reset call doesn't bother to
>>> check with the PF driver at all.  It just clobbers the PF and any VFs on
>>> it without talking to the PF driver.
>>
>> There is Keith Busch's recent patch:
>> http://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/commit/?h=pci/hotplug&id=3ebe7f9f7e4a4fd1f6461ecd01ff2961317a483a
>> .  I dunno if that's useful to you or not.
>>
>> And I'm not sure there's actually a requirement to *have* a PF driver.
>>  Obviously there has to be a way to enable the VFs, but once they're
>> enabled, it might be possible to keep using them via VF drivers even
>> without a PF driver in the picture.
>>
>> Maybe resetting the PF should just fail if there's an active VF.  If
>> you need to reset the PF, you'd have to unbind the VFs first.
> 
> The use case is certainly questionable, personally I'm not going to
> expect VFs to continue working after the PF is reset.  Driver binding
> gets complicated, especially when KVM doesn't actually bind devices to
> use them.  Hopefully we'll get that out of the tree some day though.  I
> suppose we could -EBUSY the PF reset as long as VFs are enabled.

What I could do is go through and notify the VFs that they are about to
get hit by a reset.  What they do with that information would be up to them.

So if the VFs are loaded on the host I could then at least allow them to
recover by saving and restoring the config space within the driver
themselves.

>>>> But I'm not really keen on pci_get_device() in the first place.  It works
>>>> by iterating over all PCI devices in the system, which seems like a
>>>> sledgehammer approach.  It *is* widely used, but mostly in quirk-type code
>>>> from which I avert my eyes.
>>>>
>>>> Maybe you could do something based on pci_walk_bus()?  If you did that, I
>>>> think the PCI_SRIOV_VF_DID would become superfluous.
>>>>
>>>
>>> I can look into that, I'm not familiar with the interface.  I'll have to
>>> see what the relationship is between the PF and VF in terms of busses as
>>> I don't recall it off of the top of my head.
>>
>> This reminds me about an open problem: VFs can be on "virtual" buses,
>> which aren't really connected in the hierarchy, and I don't think we
>> have a nice way to iterate over them.  So probably pci_get_device() is
>> the best we can do now.
> 
> Yeah, those virtual buses don't have a bus->self, we just have to skip
> to bus->parent->self.  pci_walk_bus() goes in the opposite direction,
> but without an actual device hosting the bus, I don't see how it finds
> it.  Thanks,
> 
> Alex

It seems like we should be able to come up with something like
pci_walk_vbus() though or something similar.  All we would need to do is
search the VFs on the bus of the PF and all child busses to that bus if
I am not mistaken.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alex Williamson May 28, 2014, 5:03 p.m. UTC | #6

On Wed, 2014-05-28 at 09:39 -0700, Alexander Duyck wrote:
> On 05/27/2014 09:12 PM, Alex Williamson wrote:
> > On Tue, 2014-05-27 at 19:19 -0600, Bjorn Helgaas wrote:
> >> [+cc Alex, Don]
> >>
> >> On Tue, May 27, 2014 at 5:53 PM, Alexander Duyck
> >> <alexander.h.duyck@intel.com> wrote:
> >>> On 05/27/2014 03:22 PM, Bjorn Helgaas wrote:
> >>>> On Mon, May 05, 2014 at 02:25:17PM -0700, Alexander Duyck wrote:
> >>>>> This fixes an issue I found in which triggering a reset via the PCI sysfs
> >>>>> reset while SR-IOV was enabled would leave the VFs in a state in which the
> >>>>> BME and MSI-X enable bits were all cleared.
> >>>>>
> >>>>> To correct that I have added code so that the VF state is saved and restored
> >>>>> as a part of the PF save and restore state functions.  By doing this the VF
> >>>>> state is restored as well as the IOV state allowing the VFs to resume function
> >>>>> following a reset.
> >>>>>
> >>>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> >>>>> ---
> >>>>>  drivers/pci/iov.c |   48 ++++++++++++++++++++++++++++++++++++++++++++++--
> >>>>>  drivers/pci/pci.c |    2 ++
> >>>>>  drivers/pci/pci.h |    5 +++++
> >>>>>  3 files changed, 53 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> >>>>> index de7a747..645ed71 100644
> >>>>> --- a/drivers/pci/iov.c
> >>>>> +++ b/drivers/pci/iov.c
> >>>>> @@ -521,13 +521,57 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
> >>>>>  }
> >>>>>
> >>>>>  /**
> >>>>> + * pci_save_iov_state - Save the state of the VF configurations
> >>>>> + * @dev: the PCI device
> >>>>> + */
> >>>>> +int pci_save_iov_state(struct pci_dev *dev)
> >>>>> +{
> >>>>> +    struct pci_dev *vfdev = NULL;
> >>>>> +    unsigned short dev_id;
> >>>>> +
> >>>>> +    /* only search if we are a PF */
> >>>>> +    if (!dev->is_physfn)
> >>>>> +            return 0;
> >>>>> +
> >>>>> +    /* retrieve VF device ID */
> >>>>> +    pci_read_config_word(dev, dev->sriov->pos + PCI_SRIOV_VF_DID, &dev_id);
> >> ...
> >>
> >>>>> +    /* loop through all the VFs and save their state information */
> >>>>> +    while ((vfdev = pci_get_device(dev->vendor, dev_id, vfdev))) {
> >>>>> +            if (vfdev->is_virtfn && (vfdev->physfn == dev)) {
> >>>>> +                    int err = pci_save_state(vfdev);
> >>>>
> >>>> It makes me uneasy to operate on another device (we're resetting A, and
> >>>> here we save state for B).  I know B is dependent on A, since B is a VF
> >>>> related to PF A, but what synchronization is there to serialize this
> >>>> against any other save/restore operations that may be in progress by B's
> >>>> driver or by a sysfs operation on B?
> >>>
> >>> I don't believe there is any synchronization mechanism in place
> >>> currently.  I can look into that as well.  Odds are we probably need to
> >>> have the VFs check the parent lock before they take any independent action.
> >>
> >> It's just the whole question of how we manage the single "saved-state"
> >> area.  Right now, I think almost all use of it is under control of the
> >> driver that owns the device, in suspend/resume methods.  The
> >> exceptions are the PM suspend/freeze/etc. routines in
> >> pci/pci-driver.c, which I assume prevent the driver from running and
> >> are therefore safe, and the reset path.  I don't know how the
> > 
> > Makes me a little uneasy too, what happens to a transaction headed
> > to/from the VF while the PF is in a reset state?  I suspect not good
> > things.  OTOH, the reset interface and a good bit of pci-sysfs have
> > always been at-your-own-risk interfaces and this restores some bits that
> > might get us closer to it being survivable.
> > 
> > We do have a way for drivers to get a long-term save state that they can
> > keep on their own, pci_save_state(); pci_store_saved_state() along with
> > pci_load_saved_state(); pci_restore_state().  Both KVM and VFIO use this
> > for assigning a device so we can attempt to re-load the pre-assigned
> > saved state.
> 
> So it sounds like this patch would interfere with the functioning of KVM
> and VFIO then.  So perhaps I should just not take the direct approach of
> saving it myself then.  Perhaps it would be better to just use the
> notifier provided below to notify the VFs that an event has occurred.

No, I don't think it interferes with KVM/VFIO any more than it does any
other VF driver.  KVM/VFIO will save off the pre-assigned device state
into their own buffer to be restored later, so use of the save buffer
here should be ok unless there's a direct race.  I think you're still
making the VF more likely to be usable after a PF reset, but maybe the
question is still whether we should allow it or pretend that we can
handle continued VF operation after a PF reset.

> >>>> Is there anything in the reset path that pays attention to whether
> >>>> resetting this PF will clobber VFs?  Do we care whether those VFs are in
> >>>> use?  I assume they might be in use by guests?
> >>>
> >>> The problem I found was that the sysfs reset call doesn't bother to
> >>> check with the PF driver at all.  It just clobbers the PF and any VFs on
> >>> it without talking to the PF driver.
> >>
> >> There is Keith Busch's recent patch:
> >> http://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/commit/?h=pci/hotplug&id=3ebe7f9f7e4a4fd1f6461ecd01ff2961317a483a
> >> .  I dunno if that's useful to you or not.
> >>
> >> And I'm not sure there's actually a requirement to *have* a PF driver.
> >>  Obviously there has to be a way to enable the VFs, but once they're
> >> enabled, it might be possible to keep using them via VF drivers even
> >> without a PF driver in the picture.
> >>
> >> Maybe resetting the PF should just fail if there's an active VF.  If
> >> you need to reset the PF, you'd have to unbind the VFs first.
> > 
> > The use case is certainly questionable, personally I'm not going to
> > expect VFs to continue working after the PF is reset.  Driver binding
> > gets complicated, especially when KVM doesn't actually bind devices to
> > use them.  Hopefully we'll get that out of the tree some day though.  I
> > suppose we could -EBUSY the PF reset as long as VFs are enabled.
> 
> What I could do is go through and notify the VFs that they are about to
> get hit by a reset.  What they do with that information would be up to them.
> 
> So if the VFs are loaded on the host I could then at least allow them to
> recover by saving and restoring the config space within the driver
> themselves.

Maybe if the notifier list had to return some affirmative result for the
PF reset to proceed.  Thanks,

Alex

> >>>> But I'm not really keen on pci_get_device() in the first place.  It works
> >>>> by iterating over all PCI devices in the system, which seems like a
> >>>> sledgehammer approach.  It *is* widely used, but mostly in quirk-type code
> >>>> from which I avert my eyes.
> >>>>
> >>>> Maybe you could do something based on pci_walk_bus()?  If you did that, I
> >>>> think the PCI_SRIOV_VF_DID would become superfluous.
> >>>>
> >>>
> >>> I can look into that, I'm not familiar with the interface.  I'll have to
> >>> see what the relationship is between the PF and VF in terms of busses as
> >>> I don't recall it off of the top of my head.
> >>
> >> This reminds me about an open problem: VFs can be on "virtual" buses,
> >> which aren't really connected in the hierarchy, and I don't think we
> >> have a nice way to iterate over them.  So probably pci_get_device() is
> >> the best we can do now.
> > 
> > Yeah, those virtual buses don't have a bus->self, we just have to skip
> > to bus->parent->self.  pci_walk_bus() goes in the opposite direction,
> > but without an actual device hosting the bus, I don't see how it finds
> > it.  Thanks,
> > 
> > Alex
> 
> It seems like we should be able to come up with something like
> pci_walk_vbus() though or something similar.  All we would need to do is
> search the VFs on the bus of the PF and all child busses to that bus if
> I am not mistaken.
> 
> Thanks,
> 
> Alex
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bjorn Helgaas May 28, 2014, 8:14 p.m. UTC | #7

On Wed, May 28, 2014 at 10:39 AM, Alexander Duyck
<alexander.h.duyck@intel.com> wrote:
> On 05/27/2014 09:12 PM, Alex Williamson wrote:
>> On Tue, 2014-05-27 at 19:19 -0600, Bjorn Helgaas wrote:

>>> Maybe resetting the PF should just fail if there's an active VF.  If
>>> you need to reset the PF, you'd have to unbind the VFs first.
>>
>> The use case is certainly questionable, personally I'm not going to
>> expect VFs to continue working after the PF is reset.  Driver binding
>> gets complicated, especially when KVM doesn't actually bind devices to
>> use them.  Hopefully we'll get that out of the tree some day though.  I
>> suppose we could -EBUSY the PF reset as long as VFs are enabled.
>
> What I could do is go through and notify the VFs that they are about to
> get hit by a reset.  What they do with that information would be up to them.
>
> So if the VFs are loaded on the host I could then at least allow them to
> recover by saving and restoring the config space within the driver
> themselves.

I really like the idea of punting by failing the PF reset if there are
any active VFs.  That's a really easy way of making sure we aren't
going to blow up any guests.  What problems would it cause if we went
this route?

>>> This reminds me about an open problem: VFs can be on "virtual" buses,
>>> which aren't really connected in the hierarchy, and I don't think we
>>> have a nice way to iterate over them.  So probably pci_get_device() is
>>> the best we can do now.
>>
>> Yeah, those virtual buses don't have a bus->self, we just have to skip
>> to bus->parent->self.  pci_walk_bus() goes in the opposite direction,
>> but without an actual device hosting the bus, I don't see how it finds
>> it.  Thanks,
>
> It seems like we should be able to come up with something like
> pci_walk_vbus() though or something similar.  All we would need to do is
> search the VFs on the bus of the PF and all child busses to that bus if
> I am not mistaken.

I don't think that's going to work because the virtual buses don't
appear as the child bus of anything.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Donald Dutile May 28, 2014, 8:34 p.m. UTC | #8

On 05/28/2014 04:14 PM, Bjorn Helgaas wrote:
> On Wed, May 28, 2014 at 10:39 AM, Alexander Duyck
> <alexander.h.duyck@intel.com> wrote:
>> On 05/27/2014 09:12 PM, Alex Williamson wrote:
>>> On Tue, 2014-05-27 at 19:19 -0600, Bjorn Helgaas wrote:
>
>>>> Maybe resetting the PF should just fail if there's an active VF.  If
>>>> you need to reset the PF, you'd have to unbind the VFs first.
>>>
>>> The use case is certainly questionable, personally I'm not going to
>>> expect VFs to continue working after the PF is reset.  Driver binding
>>> gets complicated, especially when KVM doesn't actually bind devices to
>>> use them.  Hopefully we'll get that out of the tree some day though.  I
>>> suppose we could -EBUSY the PF reset as long as VFs are enabled.
>>
>> What I could do is go through and notify the VFs that they are about to
>> get hit by a reset.  What they do with that information would be up to them.
>>
>> So if the VFs are loaded on the host I could then at least allow them to
>> recover by saving and restoring the config space within the driver
>> themselves.
>
> I really like the idea of punting by failing the PF reset if there are
> any active VFs.  That's a really easy way of making sure we aren't
> going to blow up any guests.  What problems would it cause if we went
> this route?
>
I think this is the safest route.  PF<->VF interaction isn't architected,
and resetting the PF with active VFs will probably hang a number of SRIOV
implementations, requiring a system-level reset to correct the compounded problem.

>>>> This reminds me about an open problem: VFs can be on "virtual" buses,
>>>> which aren't really connected in the hierarchy, and I don't think we
>>>> have a nice way to iterate over them.  So probably pci_get_device() is
>>>> the best we can do now.
>>>
>>> Yeah, those virtual buses don't have a bus->self, we just have to skip
>>> to bus->parent->self.  pci_walk_bus() goes in the opposite direction,
>>> but without an actual device hosting the bus, I don't see how it finds
>>> it.  Thanks,
>>
>> It seems like we should be able to come up with something like
>> pci_walk_vbus() though or something similar.  All we would need to do is
>> search the VFs on the bus of the PF and all child busses to that bus if
>> I am not mistaken.
>
> I don't think that's going to work because the virtual buses don't
> appear as the child bus of anything.
>
+1.

> Bjorn
>

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Duyck, Alexander H May 28, 2014, 10:23 p.m. UTC | #9

On 05/28/2014 01:34 PM, Don Dutile wrote:
> On 05/28/2014 04:14 PM, Bjorn Helgaas wrote:
>> On Wed, May 28, 2014 at 10:39 AM, Alexander Duyck
>> <alexander.h.duyck@intel.com> wrote:
>>> On 05/27/2014 09:12 PM, Alex Williamson wrote:
>>>> On Tue, 2014-05-27 at 19:19 -0600, Bjorn Helgaas wrote:
>>
>>>>> Maybe resetting the PF should just fail if there's an active VF.  If
>>>>> you need to reset the PF, you'd have to unbind the VFs first.
>>>>
>>>> The use case is certainly questionable, personally I'm not going to
>>>> expect VFs to continue working after the PF is reset.  Driver binding
>>>> gets complicated, especially when KVM doesn't actually bind devices to
>>>> use them.  Hopefully we'll get that out of the tree some day though.  I
>>>> suppose we could -EBUSY the PF reset as long as VFs are enabled.
>>>
>>> What I could do is go through and notify the VFs that they are about to
>>> get hit by a reset.  What they do with that information would be up
>>> to them.
>>>
>>> So if the VFs are loaded on the host I could then at least allow them to
>>> recover by saving and restoring the config space within the driver
>>> themselves.
>>
>> I really like the idea of punting by failing the PF reset if there are
>> any active VFs.  That's a really easy way of making sure we aren't
>> going to blow up any guests.  What problems would it cause if we went
>> this route?
>>
> I think this is the safest route.  PF<->VF interaction isn't architected,
> and resetting the PF with active VFs will probably hang a number of SRIOV
> implementations, requiring a system-level reset to correct the
> compounded problem.

Well it still might be worth while to allow a full PCIe reset in cases
where the hardware has gotten into a bad state.  It seems like it might
be worthwhile to update the newly added reset notifier to allow for the
device to indicate if it ready for a reset or not, with the default
being to return -ENOTTY if the function is not implemented.

> 
>>>>> This reminds me about an open problem: VFs can be on "virtual" buses,
>>>>> which aren't really connected in the hierarchy, and I don't think we
>>>>> have a nice way to iterate over them.  So probably pci_get_device() is
>>>>> the best we can do now.
>>>>
>>>> Yeah, those virtual buses don't have a bus->self, we just have to skip
>>>> to bus->parent->self.  pci_walk_bus() goes in the opposite direction,
>>>> but without an actual device hosting the bus, I don't see how it finds
>>>> it.  Thanks,
>>>
>>> It seems like we should be able to come up with something like
>>> pci_walk_vbus() though or something similar.  All we would need to do is
>>> search the VFs on the bus of the PF and all child busses to that bus if
>>> I am not mistaken.
>>
>> I don't think that's going to work because the virtual buses don't
>> appear as the child bus of anything.
>>
> +1.
> 

Maybe I don't understand something but I have a function that I am
already testing that seems to work for what I need.  Is there any reason
I couldn't use the bus->children list to navigate through the bus list
and get all of the children of a given bus?

I'll submit a couple patches for feedback on those bits.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

pci: Save and restore VFs as a part of a reset

Commit Message

Comments

Patch