Message ID | 20180227190520.3273.79728.stgit@localhost.localdomain (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, 27 Feb 2018 11:06:54 -0800 Alexander Duyck <alexander.duyck@gmail.com> wrote: > From: Alexander Duyck <alexander.h.duyck@intel.com> > > This patch is meant to add support for SR-IOV on devices when the VFs are > not managed by the kernel. Examples of recent patches attempting to do this > include: It appears to enable sriov when the _pf_ is not managed by the kernel, but by "managed" we mean that either there is no pf driver or the pf driver doesn't provide an sriov_configure callback, intentionally or otherwise. > virto - https://patchwork.kernel.org/patch/10241225/ > pci-stub - https://patchwork.kernel.org/patch/10109935/ > vfio - https://patchwork.kernel.org/patch/10103353/ > uio - https://patchwork.kernel.org/patch/9974031/ So is the goal to get around the issues with enabling sriov on each of the above drivers by doing it under the covers or are you really just trying to enable sriov for a truly unmanage (no pf driver) case? For example, should a driver explicitly not wanting sriov enabled implement a dummy sriov_configure function? > Since this is quickly blowing up into a multi-driver problem it is probably > best to implement this solution in one spot. > > This patch is an attempt to do that. What we do with this patch is provide > a generic call to enable SR-IOV in the case that the PF driver is either > not present, or the PF driver doesn't support configuring SR-IOV. > > A new sysfs value called sriov_unmanaged_autoprobe has been added. This > value is used as the drivers_autoprobe setting of the VFs when they are > being managed by an external entity such as userspace or device firmware > instead of being managed by the kernel. Documentation/ABI/testing/sysfs-bus-pci update is missing. > One side effect of this change is that the sriov_drivers_autoprobe and > sriov_unmanaged_autoprobe will only apply their updates when SR-IOV is > disabled. Attempts to update them when SR-IOV is in use will only update > the local value and will not update sriov->autoprobe. And we expect users to understand when sriov_drivers_autoprobe applies vs sriov_unmanaged_autoprobe, even though they're using the same interfaces to enable sriov? Are all combinations expected to work, ex. unmanaged sriov is enabled, a native pf driver loads, vfs work? Not only does it seems like there's opportunity to use this incorrectly, I think maybe it might be difficult to use correctly. > I based my patch set originally on the patch by Mark Rustad but there isn't > much left after going through and cleaning out the bits that were no longer > needed, and after incorporating the feedback from David Miller. > > I have included the authors of the original 4 patches above in the Cc here. > My hope is to get feedback and/or review on if this works for their use > cases. > > Cc: Mark Rustad <mark.d.rustad@intel.com> > Cc: Maximilian Heyne <mheyne@amazon.de> > Cc: Liang-Min Wang <liang-min.wang@intel.com> > Cc: David Woodhouse <dwmw@amazon.co.uk> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> > --- > drivers/pci/iov.c | 27 +++++++++++++++++++- > drivers/pci/pci-driver.c | 2 + > drivers/pci/pci-sysfs.c | 62 +++++++++++++++++++++++++++++++++++++++++----- > drivers/pci/pci.h | 4 ++- > include/linux/pci.h | 1 + > 5 files changed, 86 insertions(+), 10 deletions(-) > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > index 677924ae0350..7b8858bd8d03 100644 > --- a/drivers/pci/iov.c > +++ b/drivers/pci/iov.c > @@ -446,6 +446,7 @@ static int sriov_init(struct pci_dev *dev, int pos) > pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, &iov->vf_device); > iov->pgsz = pgsz; > iov->self = dev; > + iov->autoprobe = true; > iov->drivers_autoprobe = true; > pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, &iov->cap); > pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, &iov->link); > @@ -643,8 +644,11 @@ void pci_restore_iov_state(struct pci_dev *dev) > */ > void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool auto_probe) > { > - if (dev->is_physfn) > + if (dev->is_physfn) { > dev->sriov->drivers_autoprobe = auto_probe; > + if (!dev->sriov->num_VFs) > + dev->sriov->autoprobe = auto_probe; Why is dev->sriov->autoprobe set any time other than immediately prior to enabling VFs? > + } > } > > /** > @@ -703,6 +707,27 @@ void pci_disable_sriov(struct pci_dev *dev) > EXPORT_SYMBOL_GPL(pci_disable_sriov); > > /** > + * pci_sriov_configure_unmanaged - helper to configure unmanaged SR-IOV > + * @dev: the PCI device > + * @nr_virtfn: number of virtual functions to enable, 0 to disable > + * > + * Used to provide generic enable/disable SR-IOV option for devices > + * that do not manage the VFs generated by their driver, or have no > + * driver present. > + */ > +int pci_sriov_configure_unmanaged(struct pci_dev *dev, int nr_virtfn) > +{ > + int rc = 0; > + > + if (!nr_virtfn) > + pci_disable_sriov(dev); > + else > + rc = pci_enable_sriov(dev, nr_virtfn); > + > + return rc ? rc : nr_virtfn; > +} > + > +/** > * pci_num_vf - return number of VFs associated with a PF device_release_driver > * @dev: the PCI device > * > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c > index 3bed6beda051..2cc68dff6130 100644 > --- a/drivers/pci/pci-driver.c > +++ b/drivers/pci/pci-driver.c > @@ -398,7 +398,7 @@ void __weak pcibios_free_irq(struct pci_dev *dev) > #ifdef CONFIG_PCI_IOV > static inline bool pci_device_can_probe(struct pci_dev *pdev) > { > - return (!pdev->is_virtfn || pdev->physfn->sriov->drivers_autoprobe); > + return (!pdev->is_virtfn || pdev->physfn->sriov->autoprobe); > } > #else > static inline bool pci_device_can_probe(struct pci_dev *pdev) > diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c > index eb6bee8724cc..e701b6dbc267 100644 > --- a/drivers/pci/pci-sysfs.c > +++ b/drivers/pci/pci-sysfs.c > @@ -605,6 +605,7 @@ static ssize_t sriov_numvfs_store(struct device *dev, > struct device_attribute *attr, > const char *buf, size_t count) > { > + int (*sriov_configure)(struct pci_dev *dev, int num_vfs); > struct pci_dev *pdev = to_pci_dev(dev); > int ret; > u16 num_vfs; > @@ -622,15 +623,20 @@ static ssize_t sriov_numvfs_store(struct device *dev, > goto exit; > > /* is PF driver loaded w/callback */ > - if (!pdev->driver || !pdev->driver->sriov_configure) { > - pci_info(pdev, "Driver doesn't support SRIOV configuration via sysfs\n"); > - ret = -ENOENT; > - goto exit; > - } > + if (pdev->driver && pdev->driver->sriov_configure) > + sriov_configure = pdev->driver->sriov_configure; > + else > + sriov_configure = pci_sriov_configure_unmanaged; So an unwitting user is now able to enable vfs, independent of the pf... the trouble being that they're probably going to expect them to work and the more common case is that they won't. For instance, what can you do with an igbvf when igb isn't managing the pf? Or what happens when vfio-pci owns the pf, sriov is enabled via the unmanaged interface, and the pf user driver segfaults and gets killed, causing vfio-pci to restore the pf state, including wiping the sriov config? I guess I don't understand how vfs can operate fully independent of the pf and why vfio-pci wouldn't just implement a dummy sriov_configure to avoid contending with such issues. > if (num_vfs == 0) { > /* disable VFs */ > - ret = pdev->driver->sriov_configure(pdev, 0); > + ret = sriov_configure(pdev, 0); > + > + /* > + * Fall back to drivers_autoprobe in case legacy driver > + * decides to enable SR-IOV on load. > + */ > + pdev->sriov->autoprobe = pdev->sriov->drivers_autoprobe; > goto exit; > } > > @@ -642,7 +648,14 @@ static ssize_t sriov_numvfs_store(struct device *dev, > goto exit; > } > > - ret = pdev->driver->sriov_configure(pdev, num_vfs); > + /* > + * Update autoprobe based on unmanaged_autoprobe settings if PF > + * driver is not managing the SR-IOV configuration for this device. > + */ > + if (!pdev->driver || !pdev->driver->sriov_configure) > + pdev->sriov->autoprobe = pdev->sriov->unmanaged_autoprobe; Not sure why this is asymmetric, shouldn't we have an else pdev->sriov->autoprobe = pdev->sriov->drivers_autoprobe; Seems error prone otherwise, apply the one we actually intend to take effect for the new vfs in one place, consistently. > + > + ret = sriov_configure(pdev, num_vfs); > if (ret < 0) > goto exit; > > @@ -705,7 +718,37 @@ static ssize_t sriov_drivers_autoprobe_store(struct device *dev, > if (kstrtobool(buf, &drivers_autoprobe) < 0) > return -EINVAL; > > + device_lock(&pdev->dev); > + > pdev->sriov->drivers_autoprobe = drivers_autoprobe; > + if (!pdev->sriov->num_VFs) > + pdev->sriov->autoprobe = drivers_autoprobe; This is another one that seems inconsistent to me. > + > + device_unlock(&pdev->dev); > + > + return count; > +} > + > +static ssize_t sriov_unmanaged_autoprobe_show(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + struct pci_dev *pdev = to_pci_dev(dev); > + > + return sprintf(buf, "%u\n", pdev->sriov->unmanaged_autoprobe); > +} > + > +static ssize_t sriov_unmanaged_autoprobe_store(struct device *dev, > + struct device_attribute *attr, > + const char *buf, size_t count) > +{ > + struct pci_dev *pdev = to_pci_dev(dev); > + bool unmanaged_autoprobe; > + > + if (kstrtobool(buf, &unmanaged_autoprobe) < 0) > + return -EINVAL; > + > + pdev->sriov->unmanaged_autoprobe = unmanaged_autoprobe; > > return count; > } > @@ -720,6 +763,10 @@ static ssize_t sriov_drivers_autoprobe_store(struct device *dev, > static struct device_attribute sriov_drivers_autoprobe_attr = > __ATTR(sriov_drivers_autoprobe, (S_IRUGO|S_IWUSR|S_IWGRP), > sriov_drivers_autoprobe_show, sriov_drivers_autoprobe_store); > +static struct device_attribute sriov_unmanaged_autoprobe_attr = > + __ATTR(sriov_unmanaged_autoprobe, (S_IRUGO|S_IWUSR|S_IWGRP), > + sriov_unmanaged_autoprobe_show, > + sriov_unmanaged_autoprobe_store); > #endif /* CONFIG_PCI_IOV */ > > static ssize_t driver_override_store(struct device *dev, > @@ -1789,6 +1836,7 @@ static umode_t pcie_dev_attrs_are_visible(struct kobject *kobj, > &sriov_stride_attr.attr, > &sriov_vf_device_attr.attr, > &sriov_drivers_autoprobe_attr.attr, > + &sriov_unmanaged_autoprobe_attr.attr, > NULL, > }; > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > index fcd81911b127..b5f8b034f02d 100644 > --- a/drivers/pci/pci.h > +++ b/drivers/pci/pci.h > @@ -272,7 +272,9 @@ struct pci_sriov { > struct pci_dev *dev; /* Lowest numbered PF */ > struct pci_dev *self; /* This PF */ > resource_size_t barsz[PCI_SRIOV_NUM_BARS]; /* VF BAR size */ > - bool drivers_autoprobe; /* Auto probing of VFs by driver */ > + bool autoprobe; /* Auto probing of VFs by VF driver */ > + bool drivers_autoprobe; /* "" managed by PF driver */ > + bool unmanaged_autoprobe; /* "" unmanaged by kernel */ > }; > > /* pci_dev priv_flags */ > diff --git a/include/linux/pci.h b/include/linux/pci.h > index 024a1beda008..0d2b10aeadfa 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -1947,6 +1947,7 @@ static inline void pci_mmcfg_late_init(void) { } > > int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); > void pci_disable_sriov(struct pci_dev *dev); > +int pci_sriov_configure_unmanaged(struct pci_dev *dev, int num_vfs); > int pci_iov_add_virtfn(struct pci_dev *dev, int id); > void pci_iov_remove_virtfn(struct pci_dev *dev, int id); > int pci_num_vf(struct pci_dev *dev); >
On Tue, Feb 27, 2018 at 1:40 PM, Alex Williamson <alex.williamson@redhat.com> wrote: > On Tue, 27 Feb 2018 11:06:54 -0800 > Alexander Duyck <alexander.duyck@gmail.com> wrote: > >> From: Alexander Duyck <alexander.h.duyck@intel.com> >> >> This patch is meant to add support for SR-IOV on devices when the VFs are >> not managed by the kernel. Examples of recent patches attempting to do this >> include: > > It appears to enable sriov when the _pf_ is not managed by the > kernel, but by "managed" we mean that either there is no pf driver or > the pf driver doesn't provide an sriov_configure callback, > intentionally or otherwise. > >> virto - https://patchwork.kernel.org/patch/10241225/ >> pci-stub - https://patchwork.kernel.org/patch/10109935/ >> vfio - https://patchwork.kernel.org/patch/10103353/ >> uio - https://patchwork.kernel.org/patch/9974031/ > > So is the goal to get around the issues with enabling sriov on each of > the above drivers by doing it under the covers or are you really just > trying to enable sriov for a truly unmanage (no pf driver) case? For > example, should a driver explicitly not wanting sriov enabled implement > a dummy sriov_configure function? > >> Since this is quickly blowing up into a multi-driver problem it is probably >> best to implement this solution in one spot. >> >> This patch is an attempt to do that. What we do with this patch is provide >> a generic call to enable SR-IOV in the case that the PF driver is either >> not present, or the PF driver doesn't support configuring SR-IOV. >> >> A new sysfs value called sriov_unmanaged_autoprobe has been added. This >> value is used as the drivers_autoprobe setting of the VFs when they are >> being managed by an external entity such as userspace or device firmware >> instead of being managed by the kernel. > > Documentation/ABI/testing/sysfs-bus-pci update is missing. I can make sure to update that in the next version. >> One side effect of this change is that the sriov_drivers_autoprobe and >> sriov_unmanaged_autoprobe will only apply their updates when SR-IOV is >> disabled. Attempts to update them when SR-IOV is in use will only update >> the local value and will not update sriov->autoprobe. > > And we expect users to understand when sriov_drivers_autoprobe applies > vs sriov_unmanaged_autoprobe, even though they're using the same > interfaces to enable sriov? Are all combinations expected to work, ex. > unmanaged sriov is enabled, a native pf driver loads, vfs work? Not > only does it seems like there's opportunity to use this incorrectly, I > think maybe it might be difficult to use correctly. > >> I based my patch set originally on the patch by Mark Rustad but there isn't >> much left after going through and cleaning out the bits that were no longer >> needed, and after incorporating the feedback from David Miller. >> >> I have included the authors of the original 4 patches above in the Cc here. >> My hope is to get feedback and/or review on if this works for their use >> cases. >> >> Cc: Mark Rustad <mark.d.rustad@intel.com> >> Cc: Maximilian Heyne <mheyne@amazon.de> >> Cc: Liang-Min Wang <liang-min.wang@intel.com> >> Cc: David Woodhouse <dwmw@amazon.co.uk> >> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >> --- >> drivers/pci/iov.c | 27 +++++++++++++++++++- >> drivers/pci/pci-driver.c | 2 + >> drivers/pci/pci-sysfs.c | 62 +++++++++++++++++++++++++++++++++++++++++----- >> drivers/pci/pci.h | 4 ++- >> include/linux/pci.h | 1 + >> 5 files changed, 86 insertions(+), 10 deletions(-) >> >> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c >> index 677924ae0350..7b8858bd8d03 100644 >> --- a/drivers/pci/iov.c >> +++ b/drivers/pci/iov.c >> @@ -446,6 +446,7 @@ static int sriov_init(struct pci_dev *dev, int pos) >> pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, &iov->vf_device); >> iov->pgsz = pgsz; >> iov->self = dev; >> + iov->autoprobe = true; >> iov->drivers_autoprobe = true; >> pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, &iov->cap); >> pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, &iov->link); >> @@ -643,8 +644,11 @@ void pci_restore_iov_state(struct pci_dev *dev) >> */ >> void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool auto_probe) >> { >> - if (dev->is_physfn) >> + if (dev->is_physfn) { >> dev->sriov->drivers_autoprobe = auto_probe; >> + if (!dev->sriov->num_VFs) >> + dev->sriov->autoprobe = auto_probe; > > Why is dev->sriov->autoprobe set any time other than immediately prior > to enabling VFs? My concern here was drivers that are still floating around with the old module parameter option for enabling SR-IOV. In the unlikely event that somebody was to use such a driver I wanted to make certain that the drivers_autoprobe state was pre-populated. >> + } >> } >> >> /** >> @@ -703,6 +707,27 @@ void pci_disable_sriov(struct pci_dev *dev) >> EXPORT_SYMBOL_GPL(pci_disable_sriov); >> >> /** >> + * pci_sriov_configure_unmanaged - helper to configure unmanaged SR-IOV >> + * @dev: the PCI device >> + * @nr_virtfn: number of virtual functions to enable, 0 to disable >> + * >> + * Used to provide generic enable/disable SR-IOV option for devices >> + * that do not manage the VFs generated by their driver, or have no >> + * driver present. >> + */ >> +int pci_sriov_configure_unmanaged(struct pci_dev *dev, int nr_virtfn) >> +{ >> + int rc = 0; >> + >> + if (!nr_virtfn) >> + pci_disable_sriov(dev); >> + else >> + rc = pci_enable_sriov(dev, nr_virtfn); >> + >> + return rc ? rc : nr_virtfn; >> +} >> + >> +/** >> * pci_num_vf - return number of VFs associated with a PF device_release_driver >> * @dev: the PCI device >> * >> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c >> index 3bed6beda051..2cc68dff6130 100644 >> --- a/drivers/pci/pci-driver.c >> +++ b/drivers/pci/pci-driver.c >> @@ -398,7 +398,7 @@ void __weak pcibios_free_irq(struct pci_dev *dev) >> #ifdef CONFIG_PCI_IOV >> static inline bool pci_device_can_probe(struct pci_dev *pdev) >> { >> - return (!pdev->is_virtfn || pdev->physfn->sriov->drivers_autoprobe); >> + return (!pdev->is_virtfn || pdev->physfn->sriov->autoprobe); >> } >> #else >> static inline bool pci_device_can_probe(struct pci_dev *pdev) >> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c >> index eb6bee8724cc..e701b6dbc267 100644 >> --- a/drivers/pci/pci-sysfs.c >> +++ b/drivers/pci/pci-sysfs.c >> @@ -605,6 +605,7 @@ static ssize_t sriov_numvfs_store(struct device *dev, >> struct device_attribute *attr, >> const char *buf, size_t count) >> { >> + int (*sriov_configure)(struct pci_dev *dev, int num_vfs); >> struct pci_dev *pdev = to_pci_dev(dev); >> int ret; >> u16 num_vfs; >> @@ -622,15 +623,20 @@ static ssize_t sriov_numvfs_store(struct device *dev, >> goto exit; >> >> /* is PF driver loaded w/callback */ >> - if (!pdev->driver || !pdev->driver->sriov_configure) { >> - pci_info(pdev, "Driver doesn't support SRIOV configuration via sysfs\n"); >> - ret = -ENOENT; >> - goto exit; >> - } >> + if (pdev->driver && pdev->driver->sriov_configure) >> + sriov_configure = pdev->driver->sriov_configure; >> + else >> + sriov_configure = pci_sriov_configure_unmanaged; > > So an unwitting user is now able to enable vfs, independent of the > pf... the trouble being that they're probably going to expect them to > work and the more common case is that they won't. For instance, what > can you do with an igbvf when igb isn't managing the pf? Well the VFs wouldn't be able to do anything. Basically they would be sitting there with no driver loaded on them unless they are assigned to a guest, or the root user had enabled the unmanaged option. If you did load a driver on it the VF would sit there with link down unless either the PF driver is loaded or some user-space entity steps in to start managing the PF. In reality this can already happen as last I recall igb and ixgbe were already capable of having the PF driver removed when SR-IOV was enabled and VFs were assigned. Basically the VFs just report link down and don't do anything. Reloading the PF driver would have it take over in the case of igb and ixgbe since they were designed to handle that type of scenario. > Or what happens when vfio-pci owns the pf, sriov is enabled via the > unmanaged interface, and the pf user driver segfaults and gets killed, > causing vfio-pci to restore the pf state, including wiping the sriov > config? Wiping the config shouldn't have any effect on the allocated VF pci devices. It will cause the VFs to act as though they have fallen off of the bus though and the guests would see a surprise remove type behavior if I am not mistaken. The MMIO and config accesses would fail until the SR-IOV configuration is restored. Really this shouldn't be a problem as long as the SR-IOV is enabled prior to loading the vfio-pci driver as I would assume it would restore the state an re-enable SR-IOV. In the grand scheme of things how would the situation you describe be any different than someone using the "reset" sysfs control on the PF while using SR-IOV with driver supported SR-IOV? I suppose if you really wanted we could add a new call that you could put into the sriov_configure pointer that would just make it always return error. Then that way the configuration could be locked until the driver is unloaded. > I guess I don't understand how vfs can operate fully independent of the > pf and why vfio-pci wouldn't just implement a dummy sriov_configure to > avoid contending with such issues. This approach isn't without risk, but it isn't as if it is really a new risk we are introducing, and we could do further work to help smooth out issues like this. Much of this depends on the design of the device and the drivers involved. What is happening in this case is that the VFs are managed outside of the kernel itself either by some user-space entity or by a firmware entity. >> if (num_vfs == 0) { >> /* disable VFs */ >> - ret = pdev->driver->sriov_configure(pdev, 0); >> + ret = sriov_configure(pdev, 0); >> + >> + /* >> + * Fall back to drivers_autoprobe in case legacy driver >> + * decides to enable SR-IOV on load. >> + */ >> + pdev->sriov->autoprobe = pdev->sriov->drivers_autoprobe; >> goto exit; >> } >> >> @@ -642,7 +648,14 @@ static ssize_t sriov_numvfs_store(struct device *dev, >> goto exit; >> } >> >> - ret = pdev->driver->sriov_configure(pdev, num_vfs); >> + /* >> + * Update autoprobe based on unmanaged_autoprobe settings if PF >> + * driver is not managing the SR-IOV configuration for this device. >> + */ >> + if (!pdev->driver || !pdev->driver->sriov_configure) >> + pdev->sriov->autoprobe = pdev->sriov->unmanaged_autoprobe; > > Not sure why this is asymmetric, shouldn't we have an > > else > pdev->sriov->autoprobe = pdev->sriov->drivers_autoprobe; > > Seems error prone otherwise, apply the one we actually intend to take > effect for the new vfs in one place, consistently. The default is the value in drivers_autoprobe. I did it that way in order to deal with any drivers that might still be using the module parameter to enable SR-IOV or that just don't bother with the sysfs. That is why I only update autoprobe when either we disable SR-IOV or we update drivers_autoprobe and num_VFs is 0. >> + >> + ret = sriov_configure(pdev, num_vfs); >> if (ret < 0) >> goto exit; >> >> @@ -705,7 +718,37 @@ static ssize_t sriov_drivers_autoprobe_store(struct device *dev, >> if (kstrtobool(buf, &drivers_autoprobe) < 0) >> return -EINVAL; >> >> + device_lock(&pdev->dev); >> + >> pdev->sriov->drivers_autoprobe = drivers_autoprobe; >> + if (!pdev->sriov->num_VFs) >> + pdev->sriov->autoprobe = drivers_autoprobe; > > This is another one that seems inconsistent to me. See my note above. This is because there are callers to pci_enable_sriov() that do so for reasons other than using the syfs value. Specifically the legacy approach that uses the module parameter. >> + >> + device_unlock(&pdev->dev); >> + >> + return count; >> +} >> + >> +static ssize_t sriov_unmanaged_autoprobe_show(struct device *dev, >> + struct device_attribute *attr, >> + char *buf) >> +{ >> + struct pci_dev *pdev = to_pci_dev(dev); >> + >> + return sprintf(buf, "%u\n", pdev->sriov->unmanaged_autoprobe); >> +} >> + >> +static ssize_t sriov_unmanaged_autoprobe_store(struct device *dev, >> + struct device_attribute *attr, >> + const char *buf, size_t count) >> +{ >> + struct pci_dev *pdev = to_pci_dev(dev); >> + bool unmanaged_autoprobe; >> + >> + if (kstrtobool(buf, &unmanaged_autoprobe) < 0) >> + return -EINVAL; >> + >> + pdev->sriov->unmanaged_autoprobe = unmanaged_autoprobe; >> >> return count; >> } >> @@ -720,6 +763,10 @@ static ssize_t sriov_drivers_autoprobe_store(struct device *dev, >> static struct device_attribute sriov_drivers_autoprobe_attr = >> __ATTR(sriov_drivers_autoprobe, (S_IRUGO|S_IWUSR|S_IWGRP), >> sriov_drivers_autoprobe_show, sriov_drivers_autoprobe_store); >> +static struct device_attribute sriov_unmanaged_autoprobe_attr = >> + __ATTR(sriov_unmanaged_autoprobe, (S_IRUGO|S_IWUSR|S_IWGRP), >> + sriov_unmanaged_autoprobe_show, >> + sriov_unmanaged_autoprobe_store); >> #endif /* CONFIG_PCI_IOV */ >> >> static ssize_t driver_override_store(struct device *dev, >> @@ -1789,6 +1836,7 @@ static umode_t pcie_dev_attrs_are_visible(struct kobject *kobj, >> &sriov_stride_attr.attr, >> &sriov_vf_device_attr.attr, >> &sriov_drivers_autoprobe_attr.attr, >> + &sriov_unmanaged_autoprobe_attr.attr, >> NULL, >> }; >> >> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h >> index fcd81911b127..b5f8b034f02d 100644 >> --- a/drivers/pci/pci.h >> +++ b/drivers/pci/pci.h >> @@ -272,7 +272,9 @@ struct pci_sriov { >> struct pci_dev *dev; /* Lowest numbered PF */ >> struct pci_dev *self; /* This PF */ >> resource_size_t barsz[PCI_SRIOV_NUM_BARS]; /* VF BAR size */ >> - bool drivers_autoprobe; /* Auto probing of VFs by driver */ >> + bool autoprobe; /* Auto probing of VFs by VF driver */ >> + bool drivers_autoprobe; /* "" managed by PF driver */ >> + bool unmanaged_autoprobe; /* "" unmanaged by kernel */ >> }; >> >> /* pci_dev priv_flags */ >> diff --git a/include/linux/pci.h b/include/linux/pci.h >> index 024a1beda008..0d2b10aeadfa 100644 >> --- a/include/linux/pci.h >> +++ b/include/linux/pci.h >> @@ -1947,6 +1947,7 @@ static inline void pci_mmcfg_late_init(void) { } >> >> int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); >> void pci_disable_sriov(struct pci_dev *dev); >> +int pci_sriov_configure_unmanaged(struct pci_dev *dev, int num_vfs); >> int pci_iov_add_virtfn(struct pci_dev *dev, int id); >> void pci_iov_remove_virtfn(struct pci_dev *dev, int id); >> int pci_num_vf(struct pci_dev *dev); >> >
On Tue, Feb 27, 2018 at 2:25 PM, Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Tue, Feb 27, 2018 at 1:40 PM, Alex Williamson > <alex.williamson@redhat.com> wrote: >> On Tue, 27 Feb 2018 11:06:54 -0800 >> Alexander Duyck <alexander.duyck@gmail.com> wrote: >> >>> From: Alexander Duyck <alexander.h.duyck@intel.com> >>> >>> This patch is meant to add support for SR-IOV on devices when the VFs are >>> not managed by the kernel. Examples of recent patches attempting to do this >>> include: >> >> It appears to enable sriov when the _pf_ is not managed by the >> kernel, but by "managed" we mean that either there is no pf driver or >> the pf driver doesn't provide an sriov_configure callback, >> intentionally or otherwise. >> >>> virto - https://patchwork.kernel.org/patch/10241225/ >>> pci-stub - https://patchwork.kernel.org/patch/10109935/ >>> vfio - https://patchwork.kernel.org/patch/10103353/ >>> uio - https://patchwork.kernel.org/patch/9974031/ >> >> So is the goal to get around the issues with enabling sriov on each of >> the above drivers by doing it under the covers or are you really just >> trying to enable sriov for a truly unmanage (no pf driver) case? For >> example, should a driver explicitly not wanting sriov enabled implement >> a dummy sriov_configure function? >> >>> Since this is quickly blowing up into a multi-driver problem it is probably >>> best to implement this solution in one spot. >>> >>> This patch is an attempt to do that. What we do with this patch is provide >>> a generic call to enable SR-IOV in the case that the PF driver is either >>> not present, or the PF driver doesn't support configuring SR-IOV. >>> >>> A new sysfs value called sriov_unmanaged_autoprobe has been added. This >>> value is used as the drivers_autoprobe setting of the VFs when they are >>> being managed by an external entity such as userspace or device firmware >>> instead of being managed by the kernel. >> >> Documentation/ABI/testing/sysfs-bus-pci update is missing. > > I can make sure to update that in the next version. > >>> One side effect of this change is that the sriov_drivers_autoprobe and >>> sriov_unmanaged_autoprobe will only apply their updates when SR-IOV is >>> disabled. Attempts to update them when SR-IOV is in use will only update >>> the local value and will not update sriov->autoprobe. >> >> And we expect users to understand when sriov_drivers_autoprobe applies >> vs sriov_unmanaged_autoprobe, even though they're using the same >> interfaces to enable sriov? Are all combinations expected to work, ex. >> unmanaged sriov is enabled, a native pf driver loads, vfs work? Not >> only does it seems like there's opportunity to use this incorrectly, I >> think maybe it might be difficult to use correctly. >> >>> I based my patch set originally on the patch by Mark Rustad but there isn't >>> much left after going through and cleaning out the bits that were no longer >>> needed, and after incorporating the feedback from David Miller. >>> >>> I have included the authors of the original 4 patches above in the Cc here. >>> My hope is to get feedback and/or review on if this works for their use >>> cases. >>> >>> Cc: Mark Rustad <mark.d.rustad@intel.com> >>> Cc: Maximilian Heyne <mheyne@amazon.de> >>> Cc: Liang-Min Wang <liang-min.wang@intel.com> >>> Cc: David Woodhouse <dwmw@amazon.co.uk> >>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >>> --- >>> drivers/pci/iov.c | 27 +++++++++++++++++++- >>> drivers/pci/pci-driver.c | 2 + >>> drivers/pci/pci-sysfs.c | 62 +++++++++++++++++++++++++++++++++++++++++----- >>> drivers/pci/pci.h | 4 ++- >>> include/linux/pci.h | 1 + >>> 5 files changed, 86 insertions(+), 10 deletions(-) >>> >>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c >>> index 677924ae0350..7b8858bd8d03 100644 >>> --- a/drivers/pci/iov.c >>> +++ b/drivers/pci/iov.c >>> @@ -446,6 +446,7 @@ static int sriov_init(struct pci_dev *dev, int pos) >>> pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, &iov->vf_device); >>> iov->pgsz = pgsz; >>> iov->self = dev; >>> + iov->autoprobe = true; >>> iov->drivers_autoprobe = true; >>> pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, &iov->cap); >>> pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, &iov->link); >>> @@ -643,8 +644,11 @@ void pci_restore_iov_state(struct pci_dev *dev) >>> */ >>> void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool auto_probe) >>> { >>> - if (dev->is_physfn) >>> + if (dev->is_physfn) { >>> dev->sriov->drivers_autoprobe = auto_probe; >>> + if (!dev->sriov->num_VFs) >>> + dev->sriov->autoprobe = auto_probe; >> >> Why is dev->sriov->autoprobe set any time other than immediately prior >> to enabling VFs? > > My concern here was drivers that are still floating around with the > old module parameter option for enabling SR-IOV. In the unlikely event > that somebody was to use such a driver I wanted to make certain that > the drivers_autoprobe state was pre-populated. > >>> + } >>> } >>> >>> /** >>> @@ -703,6 +707,27 @@ void pci_disable_sriov(struct pci_dev *dev) >>> EXPORT_SYMBOL_GPL(pci_disable_sriov); >>> >>> /** >>> + * pci_sriov_configure_unmanaged - helper to configure unmanaged SR-IOV >>> + * @dev: the PCI device >>> + * @nr_virtfn: number of virtual functions to enable, 0 to disable >>> + * >>> + * Used to provide generic enable/disable SR-IOV option for devices >>> + * that do not manage the VFs generated by their driver, or have no >>> + * driver present. >>> + */ >>> +int pci_sriov_configure_unmanaged(struct pci_dev *dev, int nr_virtfn) >>> +{ >>> + int rc = 0; >>> + >>> + if (!nr_virtfn) >>> + pci_disable_sriov(dev); >>> + else >>> + rc = pci_enable_sriov(dev, nr_virtfn); >>> + >>> + return rc ? rc : nr_virtfn; >>> +} >>> + >>> +/** >>> * pci_num_vf - return number of VFs associated with a PF device_release_driver >>> * @dev: the PCI device >>> * >>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c >>> index 3bed6beda051..2cc68dff6130 100644 >>> --- a/drivers/pci/pci-driver.c >>> +++ b/drivers/pci/pci-driver.c >>> @@ -398,7 +398,7 @@ void __weak pcibios_free_irq(struct pci_dev *dev) >>> #ifdef CONFIG_PCI_IOV >>> static inline bool pci_device_can_probe(struct pci_dev *pdev) >>> { >>> - return (!pdev->is_virtfn || pdev->physfn->sriov->drivers_autoprobe); >>> + return (!pdev->is_virtfn || pdev->physfn->sriov->autoprobe); >>> } >>> #else >>> static inline bool pci_device_can_probe(struct pci_dev *pdev) >>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c >>> index eb6bee8724cc..e701b6dbc267 100644 >>> --- a/drivers/pci/pci-sysfs.c >>> +++ b/drivers/pci/pci-sysfs.c >>> @@ -605,6 +605,7 @@ static ssize_t sriov_numvfs_store(struct device *dev, >>> struct device_attribute *attr, >>> const char *buf, size_t count) >>> { >>> + int (*sriov_configure)(struct pci_dev *dev, int num_vfs); >>> struct pci_dev *pdev = to_pci_dev(dev); >>> int ret; >>> u16 num_vfs; >>> @@ -622,15 +623,20 @@ static ssize_t sriov_numvfs_store(struct device *dev, >>> goto exit; >>> >>> /* is PF driver loaded w/callback */ >>> - if (!pdev->driver || !pdev->driver->sriov_configure) { >>> - pci_info(pdev, "Driver doesn't support SRIOV configuration via sysfs\n"); >>> - ret = -ENOENT; >>> - goto exit; >>> - } >>> + if (pdev->driver && pdev->driver->sriov_configure) >>> + sriov_configure = pdev->driver->sriov_configure; >>> + else >>> + sriov_configure = pci_sriov_configure_unmanaged; >> >> So an unwitting user is now able to enable vfs, independent of the >> pf... the trouble being that they're probably going to expect them to >> work and the more common case is that they won't. For instance, what >> can you do with an igbvf when igb isn't managing the pf? > > Well the VFs wouldn't be able to do anything. Basically they would be > sitting there with no driver loaded on them unless they are assigned > to a guest, or the root user had enabled the unmanaged option. If you > did load a driver on it the VF would sit there with link down unless > either the PF driver is loaded or some user-space entity steps in to > start managing the PF. > > In reality this can already happen as last I recall igb and ixgbe were > already capable of having the PF driver removed when SR-IOV was > enabled and VFs were assigned. Basically the VFs just report link down > and don't do anything. Reloading the PF driver would have it take over > in the case of igb and ixgbe since they were designed to handle that > type of scenario. > >> Or what happens when vfio-pci owns the pf, sriov is enabled via the >> unmanaged interface, and the pf user driver segfaults and gets killed, >> causing vfio-pci to restore the pf state, including wiping the sriov >> config? > > Wiping the config shouldn't have any effect on the allocated VF pci > devices. It will cause the VFs to act as though they have fallen off > of the bus though and the guests would see a surprise remove type > behavior if I am not mistaken. The MMIO and config accesses would fail > until the SR-IOV configuration is restored. Really this shouldn't be a > problem as long as the SR-IOV is enabled prior to loading the vfio-pci > driver as I would assume it would restore the state an re-enable > SR-IOV. > > In the grand scheme of things how would the situation you describe be > any different than someone using the "reset" sysfs control on the PF > while using SR-IOV with driver supported SR-IOV? > > I suppose if you really wanted we could add a new call that you could > put into the sriov_configure pointer that would just make it always > return error. Then that way the configuration could be locked until > the driver is unloaded. > >> I guess I don't understand how vfs can operate fully independent of the >> pf and why vfio-pci wouldn't just implement a dummy sriov_configure to >> avoid contending with such issues. > > This approach isn't without risk, but it isn't as if it is really a > new risk we are introducing, and we could do further work to help > smooth out issues like this. Much of this depends on the design of the > device and the drivers involved. What is happening in this case is > that the VFs are managed outside of the kernel itself either by some > user-space entity or by a firmware entity. So I was thinking about this some more. In the case of vfio-pci things are a bit different since you are essentially taking a given device and handing it to a VM or userspace and it doesn't guarantee a communication between the two. My thought is to look at making SR-IOV configuration static or treat it as read-only when the vfio-pci driver is loaded on a given interface. In addition I would set the TAINT_USER flag and add a warning about loading vfio-pci on an active PF, and provide the number of VFs that were allocated. The idea with all of this being that we would at least have a partial lock on all of this so that you can allocate some number of VFs, and then start partitioning things up where you could assign the PF and VFs as needed to the various interfaces. Once the PF is assigned to the vfio-pci driver it would be locked in terms of the number of VFs that are provided so there would be no need for an extra communication channel between the PF driver and the host to deal with changes. Thoughts?
On 2/28/2018 9:49 AM, Alexander Duyck wrote: > On Tue, Feb 27, 2018 at 2:25 PM, Alexander Duyck > <alexander.duyck@gmail.com> wrote: >> On Tue, Feb 27, 2018 at 1:40 PM, Alex Williamson >> <alex.williamson@redhat.com> wrote: >>> On Tue, 27 Feb 2018 11:06:54 -0800 >>> Alexander Duyck <alexander.duyck@gmail.com> wrote: >>> >>>> From: Alexander Duyck <alexander.h.duyck@intel.com> >>>> >>>> This patch is meant to add support for SR-IOV on devices when the VFs are >>>> not managed by the kernel. Examples of recent patches attempting to do this >>>> include: >>> It appears to enable sriov when the _pf_ is not managed by the >>> kernel, but by "managed" we mean that either there is no pf driver or >>> the pf driver doesn't provide an sriov_configure callback, >>> intentionally or otherwise. >>> >>>> virto - https://patchwork.kernel.org/patch/10241225/ >>>> pci-stub - https://patchwork.kernel.org/patch/10109935/ >>>> vfio - https://patchwork.kernel.org/patch/10103353/ >>>> uio - https://patchwork.kernel.org/patch/9974031/ >>> So is the goal to get around the issues with enabling sriov on each of >>> the above drivers by doing it under the covers or are you really just >>> trying to enable sriov for a truly unmanage (no pf driver) case? For >>> example, should a driver explicitly not wanting sriov enabled implement >>> a dummy sriov_configure function? >>> >>>> Since this is quickly blowing up into a multi-driver problem it is probably >>>> best to implement this solution in one spot. >>>> >>>> This patch is an attempt to do that. What we do with this patch is provide >>>> a generic call to enable SR-IOV in the case that the PF driver is either >>>> not present, or the PF driver doesn't support configuring SR-IOV. >>>> >>>> A new sysfs value called sriov_unmanaged_autoprobe has been added. This >>>> value is used as the drivers_autoprobe setting of the VFs when they are >>>> being managed by an external entity such as userspace or device firmware >>>> instead of being managed by the kernel. >>> Documentation/ABI/testing/sysfs-bus-pci update is missing. >> I can make sure to update that in the next version. >> >>>> One side effect of this change is that the sriov_drivers_autoprobe and >>>> sriov_unmanaged_autoprobe will only apply their updates when SR-IOV is >>>> disabled. Attempts to update them when SR-IOV is in use will only update >>>> the local value and will not update sriov->autoprobe. >>> And we expect users to understand when sriov_drivers_autoprobe applies >>> vs sriov_unmanaged_autoprobe, even though they're using the same >>> interfaces to enable sriov? Are all combinations expected to work, ex. >>> unmanaged sriov is enabled, a native pf driver loads, vfs work? Not >>> only does it seems like there's opportunity to use this incorrectly, I >>> think maybe it might be difficult to use correctly. >>> >>>> I based my patch set originally on the patch by Mark Rustad but there isn't >>>> much left after going through and cleaning out the bits that were no longer >>>> needed, and after incorporating the feedback from David Miller. >>>> >>>> I have included the authors of the original 4 patches above in the Cc here. >>>> My hope is to get feedback and/or review on if this works for their use >>>> cases. >>>> >>>> Cc: Mark Rustad <mark.d.rustad@intel.com> >>>> Cc: Maximilian Heyne <mheyne@amazon.de> >>>> Cc: Liang-Min Wang <liang-min.wang@intel.com> >>>> Cc: David Woodhouse <dwmw@amazon.co.uk> >>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >>>> --- >>>> drivers/pci/iov.c | 27 +++++++++++++++++++- >>>> drivers/pci/pci-driver.c | 2 + >>>> drivers/pci/pci-sysfs.c | 62 +++++++++++++++++++++++++++++++++++++++++----- >>>> drivers/pci/pci.h | 4 ++- >>>> include/linux/pci.h | 1 + >>>> 5 files changed, 86 insertions(+), 10 deletions(-) >>>> >>>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c >>>> index 677924ae0350..7b8858bd8d03 100644 >>>> --- a/drivers/pci/iov.c >>>> +++ b/drivers/pci/iov.c >>>> @@ -446,6 +446,7 @@ static int sriov_init(struct pci_dev *dev, int pos) >>>> pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, &iov->vf_device); >>>> iov->pgsz = pgsz; >>>> iov->self = dev; >>>> + iov->autoprobe = true; >>>> iov->drivers_autoprobe = true; >>>> pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, &iov->cap); >>>> pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, &iov->link); >>>> @@ -643,8 +644,11 @@ void pci_restore_iov_state(struct pci_dev *dev) >>>> */ >>>> void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool auto_probe) >>>> { >>>> - if (dev->is_physfn) >>>> + if (dev->is_physfn) { >>>> dev->sriov->drivers_autoprobe = auto_probe; >>>> + if (!dev->sriov->num_VFs) >>>> + dev->sriov->autoprobe = auto_probe; >>> Why is dev->sriov->autoprobe set any time other than immediately prior >>> to enabling VFs? >> My concern here was drivers that are still floating around with the >> old module parameter option for enabling SR-IOV. In the unlikely event >> that somebody was to use such a driver I wanted to make certain that >> the drivers_autoprobe state was pre-populated. >> >>>> + } >>>> } >>>> >>>> /** >>>> @@ -703,6 +707,27 @@ void pci_disable_sriov(struct pci_dev *dev) >>>> EXPORT_SYMBOL_GPL(pci_disable_sriov); >>>> >>>> /** >>>> + * pci_sriov_configure_unmanaged - helper to configure unmanaged SR-IOV >>>> + * @dev: the PCI device >>>> + * @nr_virtfn: number of virtual functions to enable, 0 to disable >>>> + * >>>> + * Used to provide generic enable/disable SR-IOV option for devices >>>> + * that do not manage the VFs generated by their driver, or have no >>>> + * driver present. >>>> + */ >>>> +int pci_sriov_configure_unmanaged(struct pci_dev *dev, int nr_virtfn) >>>> +{ >>>> + int rc = 0; >>>> + >>>> + if (!nr_virtfn) >>>> + pci_disable_sriov(dev); >>>> + else >>>> + rc = pci_enable_sriov(dev, nr_virtfn); >>>> + >>>> + return rc ? rc : nr_virtfn; >>>> +} >>>> + >>>> +/** >>>> * pci_num_vf - return number of VFs associated with a PF device_release_driver >>>> * @dev: the PCI device >>>> * >>>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c >>>> index 3bed6beda051..2cc68dff6130 100644 >>>> --- a/drivers/pci/pci-driver.c >>>> +++ b/drivers/pci/pci-driver.c >>>> @@ -398,7 +398,7 @@ void __weak pcibios_free_irq(struct pci_dev *dev) >>>> #ifdef CONFIG_PCI_IOV >>>> static inline bool pci_device_can_probe(struct pci_dev *pdev) >>>> { >>>> - return (!pdev->is_virtfn || pdev->physfn->sriov->drivers_autoprobe); >>>> + return (!pdev->is_virtfn || pdev->physfn->sriov->autoprobe); >>>> } >>>> #else >>>> static inline bool pci_device_can_probe(struct pci_dev *pdev) >>>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c >>>> index eb6bee8724cc..e701b6dbc267 100644 >>>> --- a/drivers/pci/pci-sysfs.c >>>> +++ b/drivers/pci/pci-sysfs.c >>>> @@ -605,6 +605,7 @@ static ssize_t sriov_numvfs_store(struct device *dev, >>>> struct device_attribute *attr, >>>> const char *buf, size_t count) >>>> { >>>> + int (*sriov_configure)(struct pci_dev *dev, int num_vfs); >>>> struct pci_dev *pdev = to_pci_dev(dev); >>>> int ret; >>>> u16 num_vfs; >>>> @@ -622,15 +623,20 @@ static ssize_t sriov_numvfs_store(struct device *dev, >>>> goto exit; >>>> >>>> /* is PF driver loaded w/callback */ >>>> - if (!pdev->driver || !pdev->driver->sriov_configure) { >>>> - pci_info(pdev, "Driver doesn't support SRIOV configuration via sysfs\n"); >>>> - ret = -ENOENT; >>>> - goto exit; >>>> - } >>>> + if (pdev->driver && pdev->driver->sriov_configure) >>>> + sriov_configure = pdev->driver->sriov_configure; >>>> + else >>>> + sriov_configure = pci_sriov_configure_unmanaged; >>> So an unwitting user is now able to enable vfs, independent of the >>> pf... the trouble being that they're probably going to expect them to >>> work and the more common case is that they won't. For instance, what >>> can you do with an igbvf when igb isn't managing the pf? >> Well the VFs wouldn't be able to do anything. Basically they would be >> sitting there with no driver loaded on them unless they are assigned >> to a guest, or the root user had enabled the unmanaged option. If you >> did load a driver on it the VF would sit there with link down unless >> either the PF driver is loaded or some user-space entity steps in to >> start managing the PF. >> >> In reality this can already happen as last I recall igb and ixgbe were >> already capable of having the PF driver removed when SR-IOV was >> enabled and VFs were assigned. Basically the VFs just report link down >> and don't do anything. Reloading the PF driver would have it take over >> in the case of igb and ixgbe since they were designed to handle that >> type of scenario. >> >>> Or what happens when vfio-pci owns the pf, sriov is enabled via the >>> unmanaged interface, and the pf user driver segfaults and gets killed, >>> causing vfio-pci to restore the pf state, including wiping the sriov >>> config? >> Wiping the config shouldn't have any effect on the allocated VF pci >> devices. It will cause the VFs to act as though they have fallen off >> of the bus though and the guests would see a surprise remove type >> behavior if I am not mistaken. The MMIO and config accesses would fail >> until the SR-IOV configuration is restored. Really this shouldn't be a >> problem as long as the SR-IOV is enabled prior to loading the vfio-pci >> driver as I would assume it would restore the state an re-enable >> SR-IOV. >> >> In the grand scheme of things how would the situation you describe be >> any different than someone using the "reset" sysfs control on the PF >> while using SR-IOV with driver supported SR-IOV? >> >> I suppose if you really wanted we could add a new call that you could >> put into the sriov_configure pointer that would just make it always >> return error. Then that way the configuration could be locked until >> the driver is unloaded. >> >>> I guess I don't understand how vfs can operate fully independent of the >>> pf and why vfio-pci wouldn't just implement a dummy sriov_configure to >>> avoid contending with such issues. >> This approach isn't without risk, but it isn't as if it is really a >> new risk we are introducing, and we could do further work to help >> smooth out issues like this. Much of this depends on the design of the >> device and the drivers involved. What is happening in this case is >> that the VFs are managed outside of the kernel itself either by some >> user-space entity or by a firmware entity. > So I was thinking about this some more. In the case of vfio-pci things > are a bit different since you are essentially taking a given device > and handing it to a VM or userspace and it doesn't guarantee a > communication between the two. > > My thought is to look at making SR-IOV configuration static or treat > it as read-only when the vfio-pci driver is loaded on a given > interface. In addition I would set the TAINT_USER flag and add a > warning about loading vfio-pci on an active PF, and provide the number > of VFs that were allocated. > > The idea with all of this being that we would at least have a partial > lock on all of this so that you can allocate some number of VFs, and > then start partitioning things up where you could assign the PF and > VFs as needed to the various interfaces. Once the PF is assigned to > the vfio-pci driver it would be locked in terms of the number of VFs > that are provided so there would be no need for an extra communication > channel between the PF driver and the host to deal with changes. > > Thoughts? I've been following this conversation and I think putting in some restrictions in order to simplify the design is definitely worth considering. I think there's a lot of potential for this but KISS still applies. Thanks for the work on this! - Greg
On Wed, 28 Feb 2018 09:49:21 -0800 Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Tue, Feb 27, 2018 at 2:25 PM, Alexander Duyck > <alexander.duyck@gmail.com> wrote: > > On Tue, Feb 27, 2018 at 1:40 PM, Alex Williamson > > <alex.williamson@redhat.com> wrote: > >> On Tue, 27 Feb 2018 11:06:54 -0800 > >> Alexander Duyck <alexander.duyck@gmail.com> wrote: > >> > >>> From: Alexander Duyck <alexander.h.duyck@intel.com> > >>> > >>> This patch is meant to add support for SR-IOV on devices when the VFs are > >>> not managed by the kernel. Examples of recent patches attempting to do this > >>> include: > >> > >> It appears to enable sriov when the _pf_ is not managed by the > >> kernel, but by "managed" we mean that either there is no pf driver or > >> the pf driver doesn't provide an sriov_configure callback, > >> intentionally or otherwise. > >> > >>> virto - https://patchwork.kernel.org/patch/10241225/ > >>> pci-stub - https://patchwork.kernel.org/patch/10109935/ > >>> vfio - https://patchwork.kernel.org/patch/10103353/ > >>> uio - https://patchwork.kernel.org/patch/9974031/ > >> > >> So is the goal to get around the issues with enabling sriov on each of > >> the above drivers by doing it under the covers or are you really just > >> trying to enable sriov for a truly unmanage (no pf driver) case? For > >> example, should a driver explicitly not wanting sriov enabled implement > >> a dummy sriov_configure function? > >> > >>> Since this is quickly blowing up into a multi-driver problem it is probably > >>> best to implement this solution in one spot. > >>> > >>> This patch is an attempt to do that. What we do with this patch is provide > >>> a generic call to enable SR-IOV in the case that the PF driver is either > >>> not present, or the PF driver doesn't support configuring SR-IOV. > >>> > >>> A new sysfs value called sriov_unmanaged_autoprobe has been added. This > >>> value is used as the drivers_autoprobe setting of the VFs when they are > >>> being managed by an external entity such as userspace or device firmware > >>> instead of being managed by the kernel. > >> > >> Documentation/ABI/testing/sysfs-bus-pci update is missing. > > > > I can make sure to update that in the next version. > > > >>> One side effect of this change is that the sriov_drivers_autoprobe and > >>> sriov_unmanaged_autoprobe will only apply their updates when SR-IOV is > >>> disabled. Attempts to update them when SR-IOV is in use will only update > >>> the local value and will not update sriov->autoprobe. > >> > >> And we expect users to understand when sriov_drivers_autoprobe applies > >> vs sriov_unmanaged_autoprobe, even though they're using the same > >> interfaces to enable sriov? Are all combinations expected to work, ex. > >> unmanaged sriov is enabled, a native pf driver loads, vfs work? Not > >> only does it seems like there's opportunity to use this incorrectly, I > >> think maybe it might be difficult to use correctly. > >> > >>> I based my patch set originally on the patch by Mark Rustad but there isn't > >>> much left after going through and cleaning out the bits that were no longer > >>> needed, and after incorporating the feedback from David Miller. > >>> > >>> I have included the authors of the original 4 patches above in the Cc here. > >>> My hope is to get feedback and/or review on if this works for their use > >>> cases. > >>> > >>> Cc: Mark Rustad <mark.d.rustad@intel.com> > >>> Cc: Maximilian Heyne <mheyne@amazon.de> > >>> Cc: Liang-Min Wang <liang-min.wang@intel.com> > >>> Cc: David Woodhouse <dwmw@amazon.co.uk> > >>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> > >>> --- > >>> drivers/pci/iov.c | 27 +++++++++++++++++++- > >>> drivers/pci/pci-driver.c | 2 + > >>> drivers/pci/pci-sysfs.c | 62 +++++++++++++++++++++++++++++++++++++++++----- > >>> drivers/pci/pci.h | 4 ++- > >>> include/linux/pci.h | 1 + > >>> 5 files changed, 86 insertions(+), 10 deletions(-) > >>> > >>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > >>> index 677924ae0350..7b8858bd8d03 100644 > >>> --- a/drivers/pci/iov.c > >>> +++ b/drivers/pci/iov.c > >>> @@ -446,6 +446,7 @@ static int sriov_init(struct pci_dev *dev, int pos) > >>> pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, &iov->vf_device); > >>> iov->pgsz = pgsz; > >>> iov->self = dev; > >>> + iov->autoprobe = true; > >>> iov->drivers_autoprobe = true; > >>> pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, &iov->cap); > >>> pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, &iov->link); > >>> @@ -643,8 +644,11 @@ void pci_restore_iov_state(struct pci_dev *dev) > >>> */ > >>> void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool auto_probe) > >>> { > >>> - if (dev->is_physfn) > >>> + if (dev->is_physfn) { > >>> dev->sriov->drivers_autoprobe = auto_probe; > >>> + if (!dev->sriov->num_VFs) > >>> + dev->sriov->autoprobe = auto_probe; > >> > >> Why is dev->sriov->autoprobe set any time other than immediately prior > >> to enabling VFs? > > > > My concern here was drivers that are still floating around with the > > old module parameter option for enabling SR-IOV. In the unlikely event > > that somebody was to use such a driver I wanted to make certain that > > the drivers_autoprobe state was pre-populated. Good point, but maybe that just means we should be setting it in sriov_enable()? > >>> + } > >>> } > >>> > >>> /** > >>> @@ -703,6 +707,27 @@ void pci_disable_sriov(struct pci_dev *dev) > >>> EXPORT_SYMBOL_GPL(pci_disable_sriov); > >>> > >>> /** > >>> + * pci_sriov_configure_unmanaged - helper to configure unmanaged SR-IOV > >>> + * @dev: the PCI device > >>> + * @nr_virtfn: number of virtual functions to enable, 0 to disable > >>> + * > >>> + * Used to provide generic enable/disable SR-IOV option for devices > >>> + * that do not manage the VFs generated by their driver, or have no > >>> + * driver present. > >>> + */ > >>> +int pci_sriov_configure_unmanaged(struct pci_dev *dev, int nr_virtfn) > >>> +{ > >>> + int rc = 0; > >>> + > >>> + if (!nr_virtfn) > >>> + pci_disable_sriov(dev); > >>> + else > >>> + rc = pci_enable_sriov(dev, nr_virtfn); > >>> + > >>> + return rc ? rc : nr_virtfn; > >>> +} > >>> + > >>> +/** > >>> * pci_num_vf - return number of VFs associated with a PF device_release_driver > >>> * @dev: the PCI device > >>> * > >>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c > >>> index 3bed6beda051..2cc68dff6130 100644 > >>> --- a/drivers/pci/pci-driver.c > >>> +++ b/drivers/pci/pci-driver.c > >>> @@ -398,7 +398,7 @@ void __weak pcibios_free_irq(struct pci_dev *dev) > >>> #ifdef CONFIG_PCI_IOV > >>> static inline bool pci_device_can_probe(struct pci_dev *pdev) > >>> { > >>> - return (!pdev->is_virtfn || pdev->physfn->sriov->drivers_autoprobe); > >>> + return (!pdev->is_virtfn || pdev->physfn->sriov->autoprobe); > >>> } > >>> #else > >>> static inline bool pci_device_can_probe(struct pci_dev *pdev) > >>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c > >>> index eb6bee8724cc..e701b6dbc267 100644 > >>> --- a/drivers/pci/pci-sysfs.c > >>> +++ b/drivers/pci/pci-sysfs.c > >>> @@ -605,6 +605,7 @@ static ssize_t sriov_numvfs_store(struct device *dev, > >>> struct device_attribute *attr, > >>> const char *buf, size_t count) > >>> { > >>> + int (*sriov_configure)(struct pci_dev *dev, int num_vfs); > >>> struct pci_dev *pdev = to_pci_dev(dev); > >>> int ret; > >>> u16 num_vfs; > >>> @@ -622,15 +623,20 @@ static ssize_t sriov_numvfs_store(struct device *dev, > >>> goto exit; > >>> > >>> /* is PF driver loaded w/callback */ > >>> - if (!pdev->driver || !pdev->driver->sriov_configure) { > >>> - pci_info(pdev, "Driver doesn't support SRIOV configuration via sysfs\n"); > >>> - ret = -ENOENT; > >>> - goto exit; > >>> - } > >>> + if (pdev->driver && pdev->driver->sriov_configure) > >>> + sriov_configure = pdev->driver->sriov_configure; > >>> + else > >>> + sriov_configure = pci_sriov_configure_unmanaged; > >> > >> So an unwitting user is now able to enable vfs, independent of the > >> pf... the trouble being that they're probably going to expect them to > >> work and the more common case is that they won't. For instance, what > >> can you do with an igbvf when igb isn't managing the pf? > > > > Well the VFs wouldn't be able to do anything. Basically they would be > > sitting there with no driver loaded on them unless they are assigned > > to a guest, or the root user had enabled the unmanaged option. If you > > did load a driver on it the VF would sit there with link down unless > > either the PF driver is loaded or some user-space entity steps in to > > start managing the PF. > > > > In reality this can already happen as last I recall igb and ixgbe were > > already capable of having the PF driver removed when SR-IOV was > > enabled and VFs were assigned. Basically the VFs just report link down > > and don't do anything. Reloading the PF driver would have it take over > > in the case of igb and ixgbe since they were designed to handle that > > type of scenario. I think only ixgbe behaves this way between the two, igb disables sriov in its remove function unless we're hung up by that silly pci_vfs_assigned() check, which doesn't apply to vfio assignment. Regardless, yes, some pf drivers do leave sriov enabled on remove, whether that's useful or reasonable is another question. > >> Or what happens when vfio-pci owns the pf, sriov is enabled via the > >> unmanaged interface, and the pf user driver segfaults and gets killed, > >> causing vfio-pci to restore the pf state, including wiping the sriov > >> config? > > > > Wiping the config shouldn't have any effect on the allocated VF pci > > devices. It will cause the VFs to act as though they have fallen off > > of the bus though and the guests would see a surprise remove type > > behavior if I am not mistaken. The MMIO and config accesses would fail > > until the SR-IOV configuration is restored. Really this shouldn't be a > > problem as long as the SR-IOV is enabled prior to loading the vfio-pci > > driver as I would assume it would restore the state an re-enable > > SR-IOV. Nope, sriov does not appear to be part of the device state saved and restored around reset, so I think we just end up in an inconsistent state. Also, vfio-pci stores a copy of the saved state of a device prior to giving the user access to the device and tries to restore that copy when released by the user, so anything that's part of the save state and modified via side-channels while the device is opened, would still be lost. > > In the grand scheme of things how would the situation you describe be > > any different than someone using the "reset" sysfs control on the PF > > while using SR-IOV with driver supported SR-IOV? Similar, but the pf driver has enabled sriov and can register the reset_done error handler to be notified to re-enable sriov or perform other cleanup. If the pf driver is not participating in sriov, it would seem exceptional to support this. Triggering reset via sysfs also falls more closely to shooting yourself in the foot than I think we want to design into driver/subsystem APIs. > > I suppose if you really wanted we could add a new call that you could > > put into the sriov_configure pointer that would just make it always > > return error. Then that way the configuration could be locked until > > the driver is unloaded. > > > >> I guess I don't understand how vfs can operate fully independent of the > >> pf and why vfio-pci wouldn't just implement a dummy sriov_configure to > >> avoid contending with such issues. > > > > This approach isn't without risk, but it isn't as if it is really a > > new risk we are introducing, and we could do further work to help > > smooth out issues like this. Much of this depends on the design of the > > device and the drivers involved. What is happening in this case is > > that the VFs are managed outside of the kernel itself either by some > > user-space entity or by a firmware entity. But we can't simply split vfs and pfs as separate resources, the vf is dependent on the pf and that implies a privilege hierarchy, if not at least cooperation. The pf driver can intentionally or unintentionally disconnect the vfs at any point in time, possibly leading to data loss or at least denial of service for the vf consumer. I also don't trust that there aren't numerous sriov implementations where the pf isn't privileged to the point of being able to snoop data from the vf. So what sort of usage models are we enabling for the vf? A firmware owned vf subject to these conditions seems mostly pointless. > So I was thinking about this some more. In the case of vfio-pci things > are a bit different since you are essentially taking a given device > and handing it to a VM or userspace and it doesn't guarantee a > communication between the two. Doesn't guarantee communication or cooperation or even trust. > My thought is to look at making SR-IOV configuration static or treat > it as read-only when the vfio-pci driver is loaded on a given > interface. In addition I would set the TAINT_USER flag and add a > warning about loading vfio-pci on an active PF, and provide the number > of VFs that were allocated. > > The idea with all of this being that we would at least have a partial > lock on all of this so that you can allocate some number of VFs, and > then start partitioning things up where you could assign the PF and > VFs as needed to the various interfaces. Once the PF is assigned to > the vfio-pci driver it would be locked in terms of the number of VFs > that are provided so there would be no need for an extra communication > channel between the PF driver and the host to deal with changes. > > Thoughts? Is an sriov configuration ever static? vfio is the most prolific user of the pci reset interfaces, AFAIK. Even if we add support for restoring the sriov configuration and even if the pf user isn't trying to be malicious, vf users would need to be prepared for their device arbitrarily dropping off the bus, mmio randomly (from their perspective) being disabled, and maybe other side-effects of the user just trying to use the pf device. Is there even any guarantee that a pf driver can operate identically, regardless of the state of sriov on the pf? It seems like the pf driver needs to be aware of which queues are available for itself vs the vfs in some designs. Is there still long term value in a solution for this if the kernel is tainted? Minimally, it seems like the pf driver (the user driver, not vfio-pci) needs to be a cooperating party in this exchange. How do we determine that and what QoS guarantees are they providing by agreeing to it? Otherwise it doesn't seem like sriov provides the sort of static, hard partitioning of the device that you're asking for. A vf is a piece of the pf and we've released ownership and control of the pf to a user. Thanks, Alex
On Wed, Feb 28, 2018 at 2:59 PM, Alex Williamson <alex.williamson@redhat.com> wrote: > On Wed, 28 Feb 2018 09:49:21 -0800 > Alexander Duyck <alexander.duyck@gmail.com> wrote: > >> On Tue, Feb 27, 2018 at 2:25 PM, Alexander Duyck >> <alexander.duyck@gmail.com> wrote: >> > On Tue, Feb 27, 2018 at 1:40 PM, Alex Williamson >> > <alex.williamson@redhat.com> wrote: >> >> On Tue, 27 Feb 2018 11:06:54 -0800 >> >> Alexander Duyck <alexander.duyck@gmail.com> wrote: >> >> >> >>> From: Alexander Duyck <alexander.h.duyck@intel.com> >> >>> >> >>> This patch is meant to add support for SR-IOV on devices when the VFs are >> >>> not managed by the kernel. Examples of recent patches attempting to do this >> >>> include: >> >> >> >> It appears to enable sriov when the _pf_ is not managed by the >> >> kernel, but by "managed" we mean that either there is no pf driver or >> >> the pf driver doesn't provide an sriov_configure callback, >> >> intentionally or otherwise. >> >> >> >>> virto - https://patchwork.kernel.org/patch/10241225/ >> >>> pci-stub - https://patchwork.kernel.org/patch/10109935/ >> >>> vfio - https://patchwork.kernel.org/patch/10103353/ >> >>> uio - https://patchwork.kernel.org/patch/9974031/ >> >> >> >> So is the goal to get around the issues with enabling sriov on each of >> >> the above drivers by doing it under the covers or are you really just >> >> trying to enable sriov for a truly unmanage (no pf driver) case? For >> >> example, should a driver explicitly not wanting sriov enabled implement >> >> a dummy sriov_configure function? >> >> >> >>> Since this is quickly blowing up into a multi-driver problem it is probably >> >>> best to implement this solution in one spot. >> >>> >> >>> This patch is an attempt to do that. What we do with this patch is provide >> >>> a generic call to enable SR-IOV in the case that the PF driver is either >> >>> not present, or the PF driver doesn't support configuring SR-IOV. >> >>> >> >>> A new sysfs value called sriov_unmanaged_autoprobe has been added. This >> >>> value is used as the drivers_autoprobe setting of the VFs when they are >> >>> being managed by an external entity such as userspace or device firmware >> >>> instead of being managed by the kernel. >> >> >> >> Documentation/ABI/testing/sysfs-bus-pci update is missing. >> > >> > I can make sure to update that in the next version. >> > >> >>> One side effect of this change is that the sriov_drivers_autoprobe and >> >>> sriov_unmanaged_autoprobe will only apply their updates when SR-IOV is >> >>> disabled. Attempts to update them when SR-IOV is in use will only update >> >>> the local value and will not update sriov->autoprobe. >> >> >> >> And we expect users to understand when sriov_drivers_autoprobe applies >> >> vs sriov_unmanaged_autoprobe, even though they're using the same >> >> interfaces to enable sriov? Are all combinations expected to work, ex. >> >> unmanaged sriov is enabled, a native pf driver loads, vfs work? Not >> >> only does it seems like there's opportunity to use this incorrectly, I >> >> think maybe it might be difficult to use correctly. >> >> >> >>> I based my patch set originally on the patch by Mark Rustad but there isn't >> >>> much left after going through and cleaning out the bits that were no longer >> >>> needed, and after incorporating the feedback from David Miller. >> >>> >> >>> I have included the authors of the original 4 patches above in the Cc here. >> >>> My hope is to get feedback and/or review on if this works for their use >> >>> cases. >> >>> >> >>> Cc: Mark Rustad <mark.d.rustad@intel.com> >> >>> Cc: Maximilian Heyne <mheyne@amazon.de> >> >>> Cc: Liang-Min Wang <liang-min.wang@intel.com> >> >>> Cc: David Woodhouse <dwmw@amazon.co.uk> >> >>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >> >>> --- >> >>> drivers/pci/iov.c | 27 +++++++++++++++++++- >> >>> drivers/pci/pci-driver.c | 2 + >> >>> drivers/pci/pci-sysfs.c | 62 +++++++++++++++++++++++++++++++++++++++++----- >> >>> drivers/pci/pci.h | 4 ++- >> >>> include/linux/pci.h | 1 + >> >>> 5 files changed, 86 insertions(+), 10 deletions(-) >> >>> >> >>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c >> >>> index 677924ae0350..7b8858bd8d03 100644 >> >>> --- a/drivers/pci/iov.c >> >>> +++ b/drivers/pci/iov.c >> >>> @@ -446,6 +446,7 @@ static int sriov_init(struct pci_dev *dev, int pos) >> >>> pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, &iov->vf_device); >> >>> iov->pgsz = pgsz; >> >>> iov->self = dev; >> >>> + iov->autoprobe = true; >> >>> iov->drivers_autoprobe = true; >> >>> pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, &iov->cap); >> >>> pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, &iov->link); >> >>> @@ -643,8 +644,11 @@ void pci_restore_iov_state(struct pci_dev *dev) >> >>> */ >> >>> void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool auto_probe) >> >>> { >> >>> - if (dev->is_physfn) >> >>> + if (dev->is_physfn) { >> >>> dev->sriov->drivers_autoprobe = auto_probe; >> >>> + if (!dev->sriov->num_VFs) >> >>> + dev->sriov->autoprobe = auto_probe; >> >> >> >> Why is dev->sriov->autoprobe set any time other than immediately prior >> >> to enabling VFs? >> > >> > My concern here was drivers that are still floating around with the >> > old module parameter option for enabling SR-IOV. In the unlikely event >> > that somebody was to use such a driver I wanted to make certain that >> > the drivers_autoprobe state was pre-populated. > > Good point, but maybe that just means we should be setting it in > sriov_enable()? I suppose we could. I would just have to check and see if we have any drivers lurking out there that are supporting SR-IOV without supporting the sysfs approach. As long as that is the case we could probably put it there. >> >>> + } >> >>> } >> >>> >> >>> /** >> >>> @@ -703,6 +707,27 @@ void pci_disable_sriov(struct pci_dev *dev) >> >>> EXPORT_SYMBOL_GPL(pci_disable_sriov); >> >>> >> >>> /** >> >>> + * pci_sriov_configure_unmanaged - helper to configure unmanaged SR-IOV >> >>> + * @dev: the PCI device >> >>> + * @nr_virtfn: number of virtual functions to enable, 0 to disable >> >>> + * >> >>> + * Used to provide generic enable/disable SR-IOV option for devices >> >>> + * that do not manage the VFs generated by their driver, or have no >> >>> + * driver present. >> >>> + */ >> >>> +int pci_sriov_configure_unmanaged(struct pci_dev *dev, int nr_virtfn) >> >>> +{ >> >>> + int rc = 0; >> >>> + >> >>> + if (!nr_virtfn) >> >>> + pci_disable_sriov(dev); >> >>> + else >> >>> + rc = pci_enable_sriov(dev, nr_virtfn); >> >>> + >> >>> + return rc ? rc : nr_virtfn; >> >>> +} >> >>> + >> >>> +/** >> >>> * pci_num_vf - return number of VFs associated with a PF device_release_driver >> >>> * @dev: the PCI device >> >>> * >> >>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c >> >>> index 3bed6beda051..2cc68dff6130 100644 >> >>> --- a/drivers/pci/pci-driver.c >> >>> +++ b/drivers/pci/pci-driver.c >> >>> @@ -398,7 +398,7 @@ void __weak pcibios_free_irq(struct pci_dev *dev) >> >>> #ifdef CONFIG_PCI_IOV >> >>> static inline bool pci_device_can_probe(struct pci_dev *pdev) >> >>> { >> >>> - return (!pdev->is_virtfn || pdev->physfn->sriov->drivers_autoprobe); >> >>> + return (!pdev->is_virtfn || pdev->physfn->sriov->autoprobe); >> >>> } >> >>> #else >> >>> static inline bool pci_device_can_probe(struct pci_dev *pdev) >> >>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c >> >>> index eb6bee8724cc..e701b6dbc267 100644 >> >>> --- a/drivers/pci/pci-sysfs.c >> >>> +++ b/drivers/pci/pci-sysfs.c >> >>> @@ -605,6 +605,7 @@ static ssize_t sriov_numvfs_store(struct device *dev, >> >>> struct device_attribute *attr, >> >>> const char *buf, size_t count) >> >>> { >> >>> + int (*sriov_configure)(struct pci_dev *dev, int num_vfs); >> >>> struct pci_dev *pdev = to_pci_dev(dev); >> >>> int ret; >> >>> u16 num_vfs; >> >>> @@ -622,15 +623,20 @@ static ssize_t sriov_numvfs_store(struct device *dev, >> >>> goto exit; >> >>> >> >>> /* is PF driver loaded w/callback */ >> >>> - if (!pdev->driver || !pdev->driver->sriov_configure) { >> >>> - pci_info(pdev, "Driver doesn't support SRIOV configuration via sysfs\n"); >> >>> - ret = -ENOENT; >> >>> - goto exit; >> >>> - } >> >>> + if (pdev->driver && pdev->driver->sriov_configure) >> >>> + sriov_configure = pdev->driver->sriov_configure; >> >>> + else >> >>> + sriov_configure = pci_sriov_configure_unmanaged; >> >> >> >> So an unwitting user is now able to enable vfs, independent of the >> >> pf... the trouble being that they're probably going to expect them to >> >> work and the more common case is that they won't. For instance, what >> >> can you do with an igbvf when igb isn't managing the pf? >> > >> > Well the VFs wouldn't be able to do anything. Basically they would be >> > sitting there with no driver loaded on them unless they are assigned >> > to a guest, or the root user had enabled the unmanaged option. If you >> > did load a driver on it the VF would sit there with link down unless >> > either the PF driver is loaded or some user-space entity steps in to >> > start managing the PF. >> > >> > In reality this can already happen as last I recall igb and ixgbe were >> > already capable of having the PF driver removed when SR-IOV was >> > enabled and VFs were assigned. Basically the VFs just report link down >> > and don't do anything. Reloading the PF driver would have it take over >> > in the case of igb and ixgbe since they were designed to handle that >> > type of scenario. > > I think only ixgbe behaves this way between the two, igb disables sriov > in its remove function unless we're hung up by that silly > pci_vfs_assigned() check, which doesn't apply to vfio assignment. > Regardless, yes, some pf drivers do leave sriov enabled on remove, > whether that's useful or reasonable is another question. Right. We can argue that another day. I was just sighting the igb behavior. >> >> Or what happens when vfio-pci owns the pf, sriov is enabled via the >> >> unmanaged interface, and the pf user driver segfaults and gets killed, >> >> causing vfio-pci to restore the pf state, including wiping the sriov >> >> config? >> > >> > Wiping the config shouldn't have any effect on the allocated VF pci >> > devices. It will cause the VFs to act as though they have fallen off >> > of the bus though and the guests would see a surprise remove type >> > behavior if I am not mistaken. The MMIO and config accesses would fail >> > until the SR-IOV configuration is restored. Really this shouldn't be a >> > problem as long as the SR-IOV is enabled prior to loading the vfio-pci >> > driver as I would assume it would restore the state an re-enable >> > SR-IOV. > > Nope, sriov does not appear to be part of the device state saved and > restored around reset, so I think we just end up in an inconsistent > state. Also, vfio-pci stores a copy of the saved state of a device > prior to giving the user access to the device and tries to restore that > copy when released by the user, so anything that's part of the save > state and modified via side-channels while the device is opened, would > still be lost. Well it is but it isn't. The pci_restore_state will call pci_restore_iov_state which will just repopulate the data straight out of the pdev->sriov structure. I think the assumption is we were already carrying enough information that there wasn't any point in saving the state since we could just restore it from that structure anyway. I've been chasing it down and can verify that much this morning after testing. A regular hit of the sysfs reset control will not erase the state. One issue that I think I found that may be a bug though is if I load the vfio-pci driver on an interface and unbind it I am seeing the port state reset. I have it chased down to the idle D3 code. It looks like going from vfio_pci_probe to vfio_pci_remove is triggering the equivilent of the pci_pm_reset since it is cycling me through D0->D3->D0 without restoring state after the fact. I also verified that setting disable_idle_d3 resolves the issue. Would you have any complaints about me doing a save_state in the probe call, and a restore_state in the remove? It seems like that should probably be the easiest fix. >> > In the grand scheme of things how would the situation you describe be >> > any different than someone using the "reset" sysfs control on the PF >> > while using SR-IOV with driver supported SR-IOV? > > Similar, but the pf driver has enabled sriov and can register the > reset_done error handler to be notified to re-enable sriov or perform > other cleanup. If the pf driver is not participating in sriov, it > would seem exceptional to support this. Triggering reset via sysfs > also falls more closely to shooting yourself in the foot than I think > we want to design into driver/subsystem APIs. To some extent I agree with the shooting yourself in the foot. At the same time there isn't actually all that much to re-enable SR-IOV. Restoring the PCI SR-IOV configuration space and re-enabling the bus master enable. The Amazon guys would probably know better than I since I haven't really worked much with one of these parts yet. Actually the virtio that Mark pushed may behave the same way too. As far as I know in these firmware cases the hardware itself has everything pre-partitioned and set to re-enable as soon as the SR-IOV bits are set. I think all they need is a few bits toggled and they are pretty much good to go. >> > I suppose if you really wanted we could add a new call that you could >> > put into the sriov_configure pointer that would just make it always >> > return error. Then that way the configuration could be locked until >> > the driver is unloaded. >> > >> >> I guess I don't understand how vfs can operate fully independent of the >> >> pf and why vfio-pci wouldn't just implement a dummy sriov_configure to >> >> avoid contending with such issues. >> > >> > This approach isn't without risk, but it isn't as if it is really a >> > new risk we are introducing, and we could do further work to help >> > smooth out issues like this. Much of this depends on the design of the >> > device and the drivers involved. What is happening in this case is >> > that the VFs are managed outside of the kernel itself either by some >> > user-space entity or by a firmware entity. > > But we can't simply split vfs and pfs as separate resources, the vf is > dependent on the pf and that implies a privilege hierarchy, if not at > least cooperation. The pf driver can intentionally or unintentionally > disconnect the vfs at any point in time, possibly leading to data loss > or at least denial of service for the vf consumer. I also don't trust > that there aren't numerous sriov implementations where the pf isn't > privileged to the point of being able to snoop data from the vf. So > what sort of usage models are we enabling for the vf? A firmware owned > vf subject to these conditions seems mostly pointless. In the case of these sort of parts the PF isn't really all that privileged. Usually the PF is just a VF with the SR-IOV capability hanging off of it. I suspect the only thing that might outright control anything would be the bus master enable bit. Everything else could be independent. The PF driver in such cases doesn't do much. It is basically just the host for the configuration space. >> So I was thinking about this some more. In the case of vfio-pci things >> are a bit different since you are essentially taking a given device >> and handing it to a VM or userspace and it doesn't guarantee a >> communication between the two. > > Doesn't guarantee communication or cooperation or even trust. Right, but at the same time I consider this to be a shoot yourself in the foot type scenario. If you are going to hand off your PF while VFs are active then you are asking for whatever repercussions you end up getting. I've added a warning and a TAINT_USER flag to my code at this point if you load an active PF in vfio, and I have added a function that locks the setting so it cannot be changed once we place a PF in the control of vfio-pci. The way I see it there are two scenarios. One where the PF is just a VF with an extra bit of SR-IOV configuration space floating off of it. The other is where we want to have SR-IOV enabled and have some third party managing the PF. The first one doesn't really care about the communication and would prefer that whatever owns the driver on the PF just ignore the SR-IOV portion of the configuration space. The second one actually cares and would want some sort of communication/cooperation but for now I don't see that as being the easiest thing to do so it might be best to just let it see a fixed number of VFs it just has to work with that. >> My thought is to look at making SR-IOV configuration static or treat >> it as read-only when the vfio-pci driver is loaded on a given >> interface. In addition I would set the TAINT_USER flag and add a >> warning about loading vfio-pci on an active PF, and provide the number >> of VFs that were allocated. >> >> The idea with all of this being that we would at least have a partial >> lock on all of this so that you can allocate some number of VFs, and >> then start partitioning things up where you could assign the PF and >> VFs as needed to the various interfaces. Once the PF is assigned to >> the vfio-pci driver it would be locked in terms of the number of VFs >> that are provided so there would be no need for an extra communication >> channel between the PF driver and the host to deal with changes. >> >> Thoughts? > > Is an sriov configuration ever static? vfio is the most prolific user > of the pci reset interfaces, AFAIK. Even if we add support for > restoring the sriov configuration and even if the pf user isn't trying > to be malicious, vf users would need to be prepared for their device > arbitrarily dropping off the bus, mmio randomly (from their > perspective) being disabled, and maybe other side-effects of the user > just trying to use the pf device. Is there even any guarantee that a > pf driver can operate identically, regardless of the state of sriov on > the pf? It seems like the pf driver needs to be aware of which queues > are available for itself vs the vfs in some designs. Is there still > long term value in a solution for this if the kernel is tainted? As far as the device randomly resetting I don't really see how that is any different from what we already have in terms of solutions including stuff supported by in-driver SR-IOV. The Intel drivers are notorious for resetting for any little thing like an MTU change. Us resetting due to someone calling open/release on an interface wouldn't be anything new. In addition AER can always hit us too. Yes we don't have the same recovery mechanisms in place, but in the case of something such as this we probably don't need that complex of a recovery setup. I would think there probably is. If not we wouldn't have gotten the patches earlier that were still doing the tainting and warning and making use of the vfio-pci driver to enable SR-IOV. I suspect the use case for all this is to enable SR-IOV, setup the PF in vfio-pci, and then assign VFs into and out of guests. I don't see the PF doing a lot of moving after it is setup. The taint flag only really applies if someone is looking for support and quite honestly I figure for now the USER flag is appropriate for this kind of thing since we are deferring all the networking control to the PF which is managed by userspace. > Minimally, it seems like the pf driver (the user driver, not vfio-pci) > needs to be a cooperating party in this exchange. How do we determine > that and what QoS guarantees are they providing by agreeing to it? I think the general idea here is that the user has already tested this out and determined for themselves that the user driver does what they want/need. If it didn't they wouldn't be going to all these lengths to set this up. > Otherwise it doesn't seem like sriov provides the sort of static, hard > partitioning of the device that you're asking for. A vf is a piece of > the pf and we've released ownership and control of the pf to a user. > Thanks, > > Alex I am pretty sure that you are describing is true of some, but not for all. I think the Amazon solutions and the virtio solution are doing hard partitioning of the part. I will leave it to those guys to speak for themselves since I don't know anything about the hardware design of those parts.
On Wed, 28 Feb 2018 16:36:38 -0800 Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Wed, Feb 28, 2018 at 2:59 PM, Alex Williamson > <alex.williamson@redhat.com> wrote: > > On Wed, 28 Feb 2018 09:49:21 -0800 > > Alexander Duyck <alexander.duyck@gmail.com> wrote: > > > >> On Tue, Feb 27, 2018 at 2:25 PM, Alexander Duyck > >> <alexander.duyck@gmail.com> wrote: > >> > On Tue, Feb 27, 2018 at 1:40 PM, Alex Williamson > >> > <alex.williamson@redhat.com> wrote: > >> >> On Tue, 27 Feb 2018 11:06:54 -0800 > >> >> Alexander Duyck <alexander.duyck@gmail.com> wrote: > >> >> > >> >>> From: Alexander Duyck <alexander.h.duyck@intel.com> > >> >>> > >> >>> This patch is meant to add support for SR-IOV on devices when the VFs are > >> >>> not managed by the kernel. Examples of recent patches attempting to do this > >> >>> include: > >> >> > >> >> It appears to enable sriov when the _pf_ is not managed by the > >> >> kernel, but by "managed" we mean that either there is no pf driver or > >> >> the pf driver doesn't provide an sriov_configure callback, > >> >> intentionally or otherwise. > >> >> > >> >>> virto - https://patchwork.kernel.org/patch/10241225/ > >> >>> pci-stub - https://patchwork.kernel.org/patch/10109935/ > >> >>> vfio - https://patchwork.kernel.org/patch/10103353/ > >> >>> uio - https://patchwork.kernel.org/patch/9974031/ > >> >> > >> >> So is the goal to get around the issues with enabling sriov on each of > >> >> the above drivers by doing it under the covers or are you really just > >> >> trying to enable sriov for a truly unmanage (no pf driver) case? For > >> >> example, should a driver explicitly not wanting sriov enabled implement > >> >> a dummy sriov_configure function? > >> >> > >> >>> Since this is quickly blowing up into a multi-driver problem it is probably > >> >>> best to implement this solution in one spot. > >> >>> > >> >>> This patch is an attempt to do that. What we do with this patch is provide > >> >>> a generic call to enable SR-IOV in the case that the PF driver is either > >> >>> not present, or the PF driver doesn't support configuring SR-IOV. > >> >>> > >> >>> A new sysfs value called sriov_unmanaged_autoprobe has been added. This > >> >>> value is used as the drivers_autoprobe setting of the VFs when they are > >> >>> being managed by an external entity such as userspace or device firmware > >> >>> instead of being managed by the kernel. > >> >> > >> >> Documentation/ABI/testing/sysfs-bus-pci update is missing. > >> > > >> > I can make sure to update that in the next version. > >> > > >> >>> One side effect of this change is that the sriov_drivers_autoprobe and > >> >>> sriov_unmanaged_autoprobe will only apply their updates when SR-IOV is > >> >>> disabled. Attempts to update them when SR-IOV is in use will only update > >> >>> the local value and will not update sriov->autoprobe. > >> >> > >> >> And we expect users to understand when sriov_drivers_autoprobe applies > >> >> vs sriov_unmanaged_autoprobe, even though they're using the same > >> >> interfaces to enable sriov? Are all combinations expected to work, ex. > >> >> unmanaged sriov is enabled, a native pf driver loads, vfs work? Not > >> >> only does it seems like there's opportunity to use this incorrectly, I > >> >> think maybe it might be difficult to use correctly. > >> >> > >> >>> I based my patch set originally on the patch by Mark Rustad but there isn't > >> >>> much left after going through and cleaning out the bits that were no longer > >> >>> needed, and after incorporating the feedback from David Miller. > >> >>> > >> >>> I have included the authors of the original 4 patches above in the Cc here. > >> >>> My hope is to get feedback and/or review on if this works for their use > >> >>> cases. > >> >>> > >> >>> Cc: Mark Rustad <mark.d.rustad@intel.com> > >> >>> Cc: Maximilian Heyne <mheyne@amazon.de> > >> >>> Cc: Liang-Min Wang <liang-min.wang@intel.com> > >> >>> Cc: David Woodhouse <dwmw@amazon.co.uk> > >> >>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> > >> >>> --- > >> >>> drivers/pci/iov.c | 27 +++++++++++++++++++- > >> >>> drivers/pci/pci-driver.c | 2 + > >> >>> drivers/pci/pci-sysfs.c | 62 +++++++++++++++++++++++++++++++++++++++++----- > >> >>> drivers/pci/pci.h | 4 ++- > >> >>> include/linux/pci.h | 1 + > >> >>> 5 files changed, 86 insertions(+), 10 deletions(-) > >> >>> > >> >>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > >> >>> index 677924ae0350..7b8858bd8d03 100644 > >> >>> --- a/drivers/pci/iov.c > >> >>> +++ b/drivers/pci/iov.c > >> >>> @@ -446,6 +446,7 @@ static int sriov_init(struct pci_dev *dev, int pos) > >> >>> pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, &iov->vf_device); > >> >>> iov->pgsz = pgsz; > >> >>> iov->self = dev; > >> >>> + iov->autoprobe = true; > >> >>> iov->drivers_autoprobe = true; > >> >>> pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, &iov->cap); > >> >>> pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, &iov->link); > >> >>> @@ -643,8 +644,11 @@ void pci_restore_iov_state(struct pci_dev *dev) > >> >>> */ > >> >>> void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool auto_probe) > >> >>> { > >> >>> - if (dev->is_physfn) > >> >>> + if (dev->is_physfn) { > >> >>> dev->sriov->drivers_autoprobe = auto_probe; > >> >>> + if (!dev->sriov->num_VFs) > >> >>> + dev->sriov->autoprobe = auto_probe; > >> >> > >> >> Why is dev->sriov->autoprobe set any time other than immediately prior > >> >> to enabling VFs? > >> > > >> > My concern here was drivers that are still floating around with the > >> > old module parameter option for enabling SR-IOV. In the unlikely event > >> > that somebody was to use such a driver I wanted to make certain that > >> > the drivers_autoprobe state was pre-populated. > > > > Good point, but maybe that just means we should be setting it in > > sriov_enable()? > > I suppose we could. I would just have to check and see if we have any > drivers lurking out there that are supporting SR-IOV without > supporting the sysfs approach. As long as that is the case we could > probably put it there. > > >> >>> + } > >> >>> } > >> >>> > >> >>> /** > >> >>> @@ -703,6 +707,27 @@ void pci_disable_sriov(struct pci_dev *dev) > >> >>> EXPORT_SYMBOL_GPL(pci_disable_sriov); > >> >>> > >> >>> /** > >> >>> + * pci_sriov_configure_unmanaged - helper to configure unmanaged SR-IOV > >> >>> + * @dev: the PCI device > >> >>> + * @nr_virtfn: number of virtual functions to enable, 0 to disable > >> >>> + * > >> >>> + * Used to provide generic enable/disable SR-IOV option for devices > >> >>> + * that do not manage the VFs generated by their driver, or have no > >> >>> + * driver present. > >> >>> + */ > >> >>> +int pci_sriov_configure_unmanaged(struct pci_dev *dev, int nr_virtfn) > >> >>> +{ > >> >>> + int rc = 0; > >> >>> + > >> >>> + if (!nr_virtfn) > >> >>> + pci_disable_sriov(dev); > >> >>> + else > >> >>> + rc = pci_enable_sriov(dev, nr_virtfn); > >> >>> + > >> >>> + return rc ? rc : nr_virtfn; > >> >>> +} > >> >>> + > >> >>> +/** > >> >>> * pci_num_vf - return number of VFs associated with a PF device_release_driver > >> >>> * @dev: the PCI device > >> >>> * > >> >>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c > >> >>> index 3bed6beda051..2cc68dff6130 100644 > >> >>> --- a/drivers/pci/pci-driver.c > >> >>> +++ b/drivers/pci/pci-driver.c > >> >>> @@ -398,7 +398,7 @@ void __weak pcibios_free_irq(struct pci_dev *dev) > >> >>> #ifdef CONFIG_PCI_IOV > >> >>> static inline bool pci_device_can_probe(struct pci_dev *pdev) > >> >>> { > >> >>> - return (!pdev->is_virtfn || pdev->physfn->sriov->drivers_autoprobe); > >> >>> + return (!pdev->is_virtfn || pdev->physfn->sriov->autoprobe); > >> >>> } > >> >>> #else > >> >>> static inline bool pci_device_can_probe(struct pci_dev *pdev) > >> >>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c > >> >>> index eb6bee8724cc..e701b6dbc267 100644 > >> >>> --- a/drivers/pci/pci-sysfs.c > >> >>> +++ b/drivers/pci/pci-sysfs.c > >> >>> @@ -605,6 +605,7 @@ static ssize_t sriov_numvfs_store(struct device *dev, > >> >>> struct device_attribute *attr, > >> >>> const char *buf, size_t count) > >> >>> { > >> >>> + int (*sriov_configure)(struct pci_dev *dev, int num_vfs); > >> >>> struct pci_dev *pdev = to_pci_dev(dev); > >> >>> int ret; > >> >>> u16 num_vfs; > >> >>> @@ -622,15 +623,20 @@ static ssize_t sriov_numvfs_store(struct device *dev, > >> >>> goto exit; > >> >>> > >> >>> /* is PF driver loaded w/callback */ > >> >>> - if (!pdev->driver || !pdev->driver->sriov_configure) { > >> >>> - pci_info(pdev, "Driver doesn't support SRIOV configuration via sysfs\n"); > >> >>> - ret = -ENOENT; > >> >>> - goto exit; > >> >>> - } > >> >>> + if (pdev->driver && pdev->driver->sriov_configure) > >> >>> + sriov_configure = pdev->driver->sriov_configure; > >> >>> + else > >> >>> + sriov_configure = pci_sriov_configure_unmanaged; > >> >> > >> >> So an unwitting user is now able to enable vfs, independent of the > >> >> pf... the trouble being that they're probably going to expect them to > >> >> work and the more common case is that they won't. For instance, what > >> >> can you do with an igbvf when igb isn't managing the pf? > >> > > >> > Well the VFs wouldn't be able to do anything. Basically they would be > >> > sitting there with no driver loaded on them unless they are assigned > >> > to a guest, or the root user had enabled the unmanaged option. If you > >> > did load a driver on it the VF would sit there with link down unless > >> > either the PF driver is loaded or some user-space entity steps in to > >> > start managing the PF. > >> > > >> > In reality this can already happen as last I recall igb and ixgbe were > >> > already capable of having the PF driver removed when SR-IOV was > >> > enabled and VFs were assigned. Basically the VFs just report link down > >> > and don't do anything. Reloading the PF driver would have it take over > >> > in the case of igb and ixgbe since they were designed to handle that > >> > type of scenario. > > > > I think only ixgbe behaves this way between the two, igb disables sriov > > in its remove function unless we're hung up by that silly > > pci_vfs_assigned() check, which doesn't apply to vfio assignment. > > Regardless, yes, some pf drivers do leave sriov enabled on remove, > > whether that's useful or reasonable is another question. > > Right. We can argue that another day. I was just sighting the igb behavior. > > >> >> Or what happens when vfio-pci owns the pf, sriov is enabled via the > >> >> unmanaged interface, and the pf user driver segfaults and gets killed, > >> >> causing vfio-pci to restore the pf state, including wiping the sriov > >> >> config? > >> > > >> > Wiping the config shouldn't have any effect on the allocated VF pci > >> > devices. It will cause the VFs to act as though they have fallen off > >> > of the bus though and the guests would see a surprise remove type > >> > behavior if I am not mistaken. The MMIO and config accesses would fail > >> > until the SR-IOV configuration is restored. Really this shouldn't be a > >> > problem as long as the SR-IOV is enabled prior to loading the vfio-pci > >> > driver as I would assume it would restore the state an re-enable > >> > SR-IOV. > > > > Nope, sriov does not appear to be part of the device state saved and > > restored around reset, so I think we just end up in an inconsistent > > state. Also, vfio-pci stores a copy of the saved state of a device > > prior to giving the user access to the device and tries to restore that > > copy when released by the user, so anything that's part of the save > > state and modified via side-channels while the device is opened, would > > still be lost. > > Well it is but it isn't. The pci_restore_state will call > pci_restore_iov_state which will just repopulate the data straight out > of the pdev->sriov structure. I think the assumption is we were > already carrying enough information that there wasn't any point in > saving the state since we could just restore it from that structure > anyway. Ah, right, I remember that now. > I've been chasing it down and can verify that much this morning after > testing. A regular hit of the sysfs reset control will not erase the > state. Perhaps long term, but momentarily the vf disappears and that could imply data loss, no? > One issue that I think I found that may be a bug though is if I > load the vfio-pci driver on an interface and unbind it I am seeing the > port state reset. I have it chased down to the idle D3 code. It looks > like going from vfio_pci_probe to vfio_pci_remove is triggering the > equivilent of the pci_pm_reset since it is cycling me through > D0->D3->D0 without restoring state after the fact. I also verified > that setting disable_idle_d3 resolves the issue. Would you have any > complaints about me doing a save_state in the probe call, and a > restore_state in the remove? It seems like that should probably be the > easiest fix. Hmm, perhaps I had assumed that pci_set_power_state() would handle such things, but indeed I do see other drivers calling pci_save_state() prior and pci_restore_state() after. I think we'd need to audit all of vfio-pci's calls to pci_set_power_state(), not just probe and remove. > >> > In the grand scheme of things how would the situation you describe be > >> > any different than someone using the "reset" sysfs control on the PF > >> > while using SR-IOV with driver supported SR-IOV? > > > > Similar, but the pf driver has enabled sriov and can register the > > reset_done error handler to be notified to re-enable sriov or perform > > other cleanup. If the pf driver is not participating in sriov, it > > would seem exceptional to support this. Triggering reset via sysfs > > also falls more closely to shooting yourself in the foot than I think > > we want to design into driver/subsystem APIs. > > To some extent I agree with the shooting yourself in the foot. At the > same time there isn't actually all that much to re-enable SR-IOV. > Restoring the PCI SR-IOV configuration space and re-enabling the bus > master enable. Bus master is a pretty interesting example of the trivial control a pf driver can exert on the vfs though. > The Amazon guys would probably know better than I since I haven't > really worked much with one of these parts yet. Actually the virtio > that Mark pushed may behave the same way too. As far as I know in > these firmware cases the hardware itself has everything > pre-partitioned and set to re-enable as soon as the SR-IOV bits are > set. I think all they need is a few bits toggled and they are pretty > much good to go. Are they just looking for an sriov capable stub driver? With increasing vf counts, being able to use something like vfio-pci on the pf seems like all risk with statistically insignificant increase in density. On the other hand, if there's a userspace pf management driver, why not just make it trusted by adding it as a native host kernel driver? If we're talking about tainting the host kernel to enable this interaction, maybe it should just be tainted by an out of tree, possibly non-gpl host pf driver anyway. There can't really be a case where the pf would be used by an average user without some degree of privilege or cooperation, right? > >> > I suppose if you really wanted we could add a new call that you could > >> > put into the sriov_configure pointer that would just make it always > >> > return error. Then that way the configuration could be locked until > >> > the driver is unloaded. > >> > > >> >> I guess I don't understand how vfs can operate fully independent of the > >> >> pf and why vfio-pci wouldn't just implement a dummy sriov_configure to > >> >> avoid contending with such issues. > >> > > >> > This approach isn't without risk, but it isn't as if it is really a > >> > new risk we are introducing, and we could do further work to help > >> > smooth out issues like this. Much of this depends on the design of the > >> > device and the drivers involved. What is happening in this case is > >> > that the VFs are managed outside of the kernel itself either by some > >> > user-space entity or by a firmware entity. > > > > But we can't simply split vfs and pfs as separate resources, the vf is > > dependent on the pf and that implies a privilege hierarchy, if not at > > least cooperation. The pf driver can intentionally or unintentionally > > disconnect the vfs at any point in time, possibly leading to data loss > > or at least denial of service for the vf consumer. I also don't trust > > that there aren't numerous sriov implementations where the pf isn't > > privileged to the point of being able to snoop data from the vf. So > > what sort of usage models are we enabling for the vf? A firmware owned > > vf subject to these conditions seems mostly pointless. > > In the case of these sort of parts the PF isn't really all that > privileged. Usually the PF is just a VF with the SR-IOV capability > hanging off of it. I suspect the only thing that might outright > control anything would be the bus master enable bit. Everything else > could be independent. The PF driver in such cases doesn't do much. It > is basically just the host for the configuration space. This is sounding more like an sriov capable stub driver. Certainly not all pfs behave the way you describe above, many are significantly more privileged and even if not, bus master is a pretty trivial control point. So we probably need a driver that claims devices known to behave this way, or a meta driver that bind via dynamic IDs or driver_override. The proposal here sort of covertly turns anything that's not an sriov driver into that stub driver with no guarantee that the subverted driver is a willing or safe participant. > >> So I was thinking about this some more. In the case of vfio-pci things > >> are a bit different since you are essentially taking a given device > >> and handing it to a VM or userspace and it doesn't guarantee a > >> communication between the two. > > > > Doesn't guarantee communication or cooperation or even trust. > > Right, but at the same time I consider this to be a shoot yourself in > the foot type scenario. If you are going to hand off your PF while VFs > are active then you are asking for whatever repercussions you end up > getting. I've added a warning and a TAINT_USER flag to my code at this > point if you load an active PF in vfio, and I have added a function > that locks the setting so it cannot be changed once we place a PF in > the control of vfio-pci. > > The way I see it there are two scenarios. One where the PF is just a > VF with an extra bit of SR-IOV configuration space floating off of it. > The other is where we want to have SR-IOV enabled and have some third > party managing the PF. The first one doesn't really care about the > communication and would prefer that whatever owns the driver on the PF > just ignore the SR-IOV portion of the configuration space. The second > one actually cares and would want some sort of > communication/cooperation but for now I don't see that as being the > easiest thing to do so it might be best to just let it see a fixed > number of VFs it just has to work with that. There's no such thing as a read-only sriov capability in the spec, which is a problem we ran into a while ago, vfio-pci exposes the capability, but protects it as read-only so the user cannot create devices on the host. QEMU passed this through to the guest, but that failed as soon as OVMF started supporting sriov as it's unable to size the VF BAR resources. Now QEMU drops the sriov capability from the guest capability chain. So it's not clear how this fixed number of vfs plan works. Are we inventing our own capability for this? If the pf driver is just a userspace driver and not a VM, maybe what we do now is sufficient, otherwise there's no standard for exposing fixed vfs. > >> My thought is to look at making SR-IOV configuration static or treat > >> it as read-only when the vfio-pci driver is loaded on a given > >> interface. In addition I would set the TAINT_USER flag and add a > >> warning about loading vfio-pci on an active PF, and provide the number > >> of VFs that were allocated. > >> > >> The idea with all of this being that we would at least have a partial > >> lock on all of this so that you can allocate some number of VFs, and > >> then start partitioning things up where you could assign the PF and > >> VFs as needed to the various interfaces. Once the PF is assigned to > >> the vfio-pci driver it would be locked in terms of the number of VFs > >> that are provided so there would be no need for an extra communication > >> channel between the PF driver and the host to deal with changes. > >> > >> Thoughts? > > > > Is an sriov configuration ever static? vfio is the most prolific user > > of the pci reset interfaces, AFAIK. Even if we add support for > > restoring the sriov configuration and even if the pf user isn't trying > > to be malicious, vf users would need to be prepared for their device > > arbitrarily dropping off the bus, mmio randomly (from their > > perspective) being disabled, and maybe other side-effects of the user > > just trying to use the pf device. Is there even any guarantee that a > > pf driver can operate identically, regardless of the state of sriov on > > the pf? It seems like the pf driver needs to be aware of which queues > > are available for itself vs the vfs in some designs. Is there still > > long term value in a solution for this if the kernel is tainted? > > As far as the device randomly resetting I don't really see how that is > any different from what we already have in terms of solutions > including stuff supported by in-driver SR-IOV. The Intel drivers are > notorious for resetting for any little thing like an MTU change. Us > resetting due to someone calling open/release on an interface wouldn't > be anything new. In addition AER can always hit us too. Yes we don't > have the same recovery mechanisms in place, but in the case of > something such as this we probably don't need that complex of a > recovery setup. [I think below paragraph is specifically answering last question above, still long term value in kernel tainting solution] > I would think there probably is. If not we wouldn't have gotten the > patches earlier that were still doing the tainting and warning and > making use of the vfio-pci driver to enable SR-IOV. I suspect the use > case for all this is to enable SR-IOV, setup the PF in vfio-pci, and > then assign VFs into and out of guests. I don't see the PF doing a lot > of moving after it is setup. The taint flag only really applies if > someone is looking for support and quite honestly I figure for now the > USER flag is appropriate for this kind of thing since we are deferring > all the networking control to the PF which is managed by userspace. If we're going to throw up our hands and taint the kernel, then maybe we could entertain the idea of doing that in a trivial vfio-pci sriov_configure function. But does that really meet all the use cases and what's the advantage of that vs Amazon (I guess they're the driver here) tainting the kernel with their own driver? Mellanox has tried to enable sriov in vfio-pci in the past, Cc Ilya. > > Minimally, it seems like the pf driver (the user driver, not vfio-pci) > > needs to be a cooperating party in this exchange. How do we determine > > that and what QoS guarantees are they providing by agreeing to it? > > I think the general idea here is that the user has already tested this > out and determined for themselves that the user driver does what they > want/need. If it didn't they wouldn't be going to all these lengths to > set this up. > > > Otherwise it doesn't seem like sriov provides the sort of static, hard > > partitioning of the device that you're asking for. A vf is a piece of > > the pf and we've released ownership and control of the pf to a user. > > Thanks, > > > > Alex > > I am pretty sure that you are describing is true of some, but not for > all. I think the Amazon solutions and the virtio solution are doing > hard partitioning of the part. I will leave it to those guys to speak > for themselves since I don't know anything about the hardware design > of those parts. I think we'd need device specific knowledge and enablement to be able to take advantage of any hardware partitioning, otherwise we need to assume the pf is privileged, as implemented in other sriov devices. I'm also trying to imagine whether there's a solution via the new vfio/mdev interface, where the mdev vendor driver would bind to the pf and effectively present itself as the mdev device. The vendor driver could provide sriov capabilities and bear the burden of ensuring that the pf is used cooperatively. The only existing mdev vendor drivers are vGPUs and rely on on-device DMA translation and isolation, such as through GTTs, but there have been some thoughts on providing IOMMU based isolation of mdev/sriov mixed devices (assuming DMA is even required for userspace management of the pf in this use case). [Cc Kirti] Thanks, Alex
On Thu, Mar 1, 2018 at 12:22 PM, Alex Williamson <alex.williamson@redhat.com> wrote: > On Wed, 28 Feb 2018 16:36:38 -0800 > Alexander Duyck <alexander.duyck@gmail.com> wrote: > >> On Wed, Feb 28, 2018 at 2:59 PM, Alex Williamson >> <alex.williamson@redhat.com> wrote: >> > On Wed, 28 Feb 2018 09:49:21 -0800 >> > Alexander Duyck <alexander.duyck@gmail.com> wrote: >> > >> >> On Tue, Feb 27, 2018 at 2:25 PM, Alexander Duyck >> >> <alexander.duyck@gmail.com> wrote: >> >> > On Tue, Feb 27, 2018 at 1:40 PM, Alex Williamson >> >> > <alex.williamson@redhat.com> wrote: >> >> >> On Tue, 27 Feb 2018 11:06:54 -0800 >> >> >> Alexander Duyck <alexander.duyck@gmail.com> wrote: >> >> >> >> >> >>> From: Alexander Duyck <alexander.h.duyck@intel.com> >> >> >>> >> >> >>> This patch is meant to add support for SR-IOV on devices when the VFs are >> >> >>> not managed by the kernel. Examples of recent patches attempting to do this >> >> >>> include: >> >> >> >> >> >> It appears to enable sriov when the _pf_ is not managed by the >> >> >> kernel, but by "managed" we mean that either there is no pf driver or >> >> >> the pf driver doesn't provide an sriov_configure callback, >> >> >> intentionally or otherwise. >> >> >> >> >> >>> virto - https://patchwork.kernel.org/patch/10241225/ >> >> >>> pci-stub - https://patchwork.kernel.org/patch/10109935/ >> >> >>> vfio - https://patchwork.kernel.org/patch/10103353/ >> >> >>> uio - https://patchwork.kernel.org/patch/9974031/ >> >> >> >> >> >> So is the goal to get around the issues with enabling sriov on each of >> >> >> the above drivers by doing it under the covers or are you really just >> >> >> trying to enable sriov for a truly unmanage (no pf driver) case? For >> >> >> example, should a driver explicitly not wanting sriov enabled implement >> >> >> a dummy sriov_configure function? >> >> >> >> >> >>> Since this is quickly blowing up into a multi-driver problem it is probably >> >> >>> best to implement this solution in one spot. >> >> >>> >> >> >>> This patch is an attempt to do that. What we do with this patch is provide >> >> >>> a generic call to enable SR-IOV in the case that the PF driver is either >> >> >>> not present, or the PF driver doesn't support configuring SR-IOV. >> >> >>> >> >> >>> A new sysfs value called sriov_unmanaged_autoprobe has been added. This >> >> >>> value is used as the drivers_autoprobe setting of the VFs when they are >> >> >>> being managed by an external entity such as userspace or device firmware >> >> >>> instead of being managed by the kernel. >> >> >> >> >> >> Documentation/ABI/testing/sysfs-bus-pci update is missing. >> >> > >> >> > I can make sure to update that in the next version. >> >> > >> >> >>> One side effect of this change is that the sriov_drivers_autoprobe and >> >> >>> sriov_unmanaged_autoprobe will only apply their updates when SR-IOV is >> >> >>> disabled. Attempts to update them when SR-IOV is in use will only update >> >> >>> the local value and will not update sriov->autoprobe. >> >> >> >> >> >> And we expect users to understand when sriov_drivers_autoprobe applies >> >> >> vs sriov_unmanaged_autoprobe, even though they're using the same >> >> >> interfaces to enable sriov? Are all combinations expected to work, ex. >> >> >> unmanaged sriov is enabled, a native pf driver loads, vfs work? Not >> >> >> only does it seems like there's opportunity to use this incorrectly, I >> >> >> think maybe it might be difficult to use correctly. >> >> >> >> >> >>> I based my patch set originally on the patch by Mark Rustad but there isn't >> >> >>> much left after going through and cleaning out the bits that were no longer >> >> >>> needed, and after incorporating the feedback from David Miller. >> >> >>> >> >> >>> I have included the authors of the original 4 patches above in the Cc here. >> >> >>> My hope is to get feedback and/or review on if this works for their use >> >> >>> cases. >> >> >>> >> >> >>> Cc: Mark Rustad <mark.d.rustad@intel.com> >> >> >>> Cc: Maximilian Heyne <mheyne@amazon.de> >> >> >>> Cc: Liang-Min Wang <liang-min.wang@intel.com> >> >> >>> Cc: David Woodhouse <dwmw@amazon.co.uk> >> >> >>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >> >> >>> --- >> >> >>> drivers/pci/iov.c | 27 +++++++++++++++++++- >> >> >>> drivers/pci/pci-driver.c | 2 + >> >> >>> drivers/pci/pci-sysfs.c | 62 +++++++++++++++++++++++++++++++++++++++++----- >> >> >>> drivers/pci/pci.h | 4 ++- >> >> >>> include/linux/pci.h | 1 + >> >> >>> 5 files changed, 86 insertions(+), 10 deletions(-) >> >> >>> >> >> >>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c >> >> >>> index 677924ae0350..7b8858bd8d03 100644 >> >> >>> --- a/drivers/pci/iov.c >> >> >>> +++ b/drivers/pci/iov.c >> >> >>> @@ -446,6 +446,7 @@ static int sriov_init(struct pci_dev *dev, int pos) >> >> >>> pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, &iov->vf_device); >> >> >>> iov->pgsz = pgsz; >> >> >>> iov->self = dev; >> >> >>> + iov->autoprobe = true; >> >> >>> iov->drivers_autoprobe = true; >> >> >>> pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, &iov->cap); >> >> >>> pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, &iov->link); >> >> >>> @@ -643,8 +644,11 @@ void pci_restore_iov_state(struct pci_dev *dev) >> >> >>> */ >> >> >>> void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool auto_probe) >> >> >>> { >> >> >>> - if (dev->is_physfn) >> >> >>> + if (dev->is_physfn) { >> >> >>> dev->sriov->drivers_autoprobe = auto_probe; >> >> >>> + if (!dev->sriov->num_VFs) >> >> >>> + dev->sriov->autoprobe = auto_probe; >> >> >> >> >> >> Why is dev->sriov->autoprobe set any time other than immediately prior >> >> >> to enabling VFs? >> >> > >> >> > My concern here was drivers that are still floating around with the >> >> > old module parameter option for enabling SR-IOV. In the unlikely event >> >> > that somebody was to use such a driver I wanted to make certain that >> >> > the drivers_autoprobe state was pre-populated. >> > >> > Good point, but maybe that just means we should be setting it in >> > sriov_enable()? >> >> I suppose we could. I would just have to check and see if we have any >> drivers lurking out there that are supporting SR-IOV without >> supporting the sysfs approach. As long as that is the case we could >> probably put it there. So in my follow-on I am now taking a slightly different route. I moved things over to us setting autoprobe in pci_enable_sriov, and now just bypass the calls to it and pci_disable_sriov in favor of setting autoprobe and then using sriov_enable or sriov_disable in the pci_sriov_configure_unmanged call. That should get us the same effect as what we were looking for. >> >> >>> + } >> >> >>> } >> >> >>> >> >> >>> /** >> >> >>> @@ -703,6 +707,27 @@ void pci_disable_sriov(struct pci_dev *dev) >> >> >>> EXPORT_SYMBOL_GPL(pci_disable_sriov); >> >> >>> >> >> >>> /** >> >> >>> + * pci_sriov_configure_unmanaged - helper to configure unmanaged SR-IOV >> >> >>> + * @dev: the PCI device >> >> >>> + * @nr_virtfn: number of virtual functions to enable, 0 to disable >> >> >>> + * >> >> >>> + * Used to provide generic enable/disable SR-IOV option for devices >> >> >>> + * that do not manage the VFs generated by their driver, or have no >> >> >>> + * driver present. >> >> >>> + */ >> >> >>> +int pci_sriov_configure_unmanaged(struct pci_dev *dev, int nr_virtfn) >> >> >>> +{ >> >> >>> + int rc = 0; >> >> >>> + >> >> >>> + if (!nr_virtfn) >> >> >>> + pci_disable_sriov(dev); >> >> >>> + else >> >> >>> + rc = pci_enable_sriov(dev, nr_virtfn); >> >> >>> + >> >> >>> + return rc ? rc : nr_virtfn; >> >> >>> +} >> >> >>> + >> >> >>> +/** >> >> >>> * pci_num_vf - return number of VFs associated with a PF device_release_driver >> >> >>> * @dev: the PCI device >> >> >>> * >> >> >>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c >> >> >>> index 3bed6beda051..2cc68dff6130 100644 >> >> >>> --- a/drivers/pci/pci-driver.c >> >> >>> +++ b/drivers/pci/pci-driver.c >> >> >>> @@ -398,7 +398,7 @@ void __weak pcibios_free_irq(struct pci_dev *dev) >> >> >>> #ifdef CONFIG_PCI_IOV >> >> >>> static inline bool pci_device_can_probe(struct pci_dev *pdev) >> >> >>> { >> >> >>> - return (!pdev->is_virtfn || pdev->physfn->sriov->drivers_autoprobe); >> >> >>> + return (!pdev->is_virtfn || pdev->physfn->sriov->autoprobe); >> >> >>> } >> >> >>> #else >> >> >>> static inline bool pci_device_can_probe(struct pci_dev *pdev) >> >> >>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c >> >> >>> index eb6bee8724cc..e701b6dbc267 100644 >> >> >>> --- a/drivers/pci/pci-sysfs.c >> >> >>> +++ b/drivers/pci/pci-sysfs.c >> >> >>> @@ -605,6 +605,7 @@ static ssize_t sriov_numvfs_store(struct device *dev, >> >> >>> struct device_attribute *attr, >> >> >>> const char *buf, size_t count) >> >> >>> { >> >> >>> + int (*sriov_configure)(struct pci_dev *dev, int num_vfs); >> >> >>> struct pci_dev *pdev = to_pci_dev(dev); >> >> >>> int ret; >> >> >>> u16 num_vfs; >> >> >>> @@ -622,15 +623,20 @@ static ssize_t sriov_numvfs_store(struct device *dev, >> >> >>> goto exit; >> >> >>> >> >> >>> /* is PF driver loaded w/callback */ >> >> >>> - if (!pdev->driver || !pdev->driver->sriov_configure) { >> >> >>> - pci_info(pdev, "Driver doesn't support SRIOV configuration via sysfs\n"); >> >> >>> - ret = -ENOENT; >> >> >>> - goto exit; >> >> >>> - } >> >> >>> + if (pdev->driver && pdev->driver->sriov_configure) >> >> >>> + sriov_configure = pdev->driver->sriov_configure; >> >> >>> + else >> >> >>> + sriov_configure = pci_sriov_configure_unmanaged; >> >> >> >> >> >> So an unwitting user is now able to enable vfs, independent of the >> >> >> pf... the trouble being that they're probably going to expect them to >> >> >> work and the more common case is that they won't. For instance, what >> >> >> can you do with an igbvf when igb isn't managing the pf? >> >> > >> >> > Well the VFs wouldn't be able to do anything. Basically they would be >> >> > sitting there with no driver loaded on them unless they are assigned >> >> > to a guest, or the root user had enabled the unmanaged option. If you >> >> > did load a driver on it the VF would sit there with link down unless >> >> > either the PF driver is loaded or some user-space entity steps in to >> >> > start managing the PF. >> >> > >> >> > In reality this can already happen as last I recall igb and ixgbe were >> >> > already capable of having the PF driver removed when SR-IOV was >> >> > enabled and VFs were assigned. Basically the VFs just report link down >> >> > and don't do anything. Reloading the PF driver would have it take over >> >> > in the case of igb and ixgbe since they were designed to handle that >> >> > type of scenario. >> > >> > I think only ixgbe behaves this way between the two, igb disables sriov >> > in its remove function unless we're hung up by that silly >> > pci_vfs_assigned() check, which doesn't apply to vfio assignment. >> > Regardless, yes, some pf drivers do leave sriov enabled on remove, >> > whether that's useful or reasonable is another question. >> >> Right. We can argue that another day. I was just sighting the igb behavior. >> >> >> >> Or what happens when vfio-pci owns the pf, sriov is enabled via the >> >> >> unmanaged interface, and the pf user driver segfaults and gets killed, >> >> >> causing vfio-pci to restore the pf state, including wiping the sriov >> >> >> config? >> >> > >> >> > Wiping the config shouldn't have any effect on the allocated VF pci >> >> > devices. It will cause the VFs to act as though they have fallen off >> >> > of the bus though and the guests would see a surprise remove type >> >> > behavior if I am not mistaken. The MMIO and config accesses would fail >> >> > until the SR-IOV configuration is restored. Really this shouldn't be a >> >> > problem as long as the SR-IOV is enabled prior to loading the vfio-pci >> >> > driver as I would assume it would restore the state an re-enable >> >> > SR-IOV. >> > >> > Nope, sriov does not appear to be part of the device state saved and >> > restored around reset, so I think we just end up in an inconsistent >> > state. Also, vfio-pci stores a copy of the saved state of a device >> > prior to giving the user access to the device and tries to restore that >> > copy when released by the user, so anything that's part of the save >> > state and modified via side-channels while the device is opened, would >> > still be lost. >> >> Well it is but it isn't. The pci_restore_state will call >> pci_restore_iov_state which will just repopulate the data straight out >> of the pdev->sriov structure. I think the assumption is we were >> already carrying enough information that there wasn't any point in >> saving the state since we could just restore it from that structure >> anyway. > > Ah, right, I remember that now. > >> I've been chasing it down and can verify that much this morning after >> testing. A regular hit of the sysfs reset control will not erase the >> state. > > Perhaps long term, but momentarily the vf disappears and that could > imply data loss, no? It could imply data loss, but really it is no worse then accidentally unplugging an Ethernet cable, or at least that has been my experience with the Intel Ethernet drivers. >> One issue that I think I found that may be a bug though is if I >> load the vfio-pci driver on an interface and unbind it I am seeing the >> port state reset. I have it chased down to the idle D3 code. It looks >> like going from vfio_pci_probe to vfio_pci_remove is triggering the >> equivilent of the pci_pm_reset since it is cycling me through >> D0->D3->D0 without restoring state after the fact. I also verified >> that setting disable_idle_d3 resolves the issue. Would you have any >> complaints about me doing a save_state in the probe call, and a >> restore_state in the remove? It seems like that should probably be the >> easiest fix. > > Hmm, perhaps I had assumed that pci_set_power_state() would handle such > things, but indeed I do see other drivers calling pci_save_state() > prior and pci_restore_state() after. I think we'd need to audit all of > vfio-pci's calls to pci_set_power_state(), not just probe and remove. Is it okay if I leave that to you then? I'm thinking that might be getting into things I am not really prepared to tackle. >> >> > In the grand scheme of things how would the situation you describe be >> >> > any different than someone using the "reset" sysfs control on the PF >> >> > while using SR-IOV with driver supported SR-IOV? >> > >> > Similar, but the pf driver has enabled sriov and can register the >> > reset_done error handler to be notified to re-enable sriov or perform >> > other cleanup. If the pf driver is not participating in sriov, it >> > would seem exceptional to support this. Triggering reset via sysfs >> > also falls more closely to shooting yourself in the foot than I think >> > we want to design into driver/subsystem APIs. >> >> To some extent I agree with the shooting yourself in the foot. At the >> same time there isn't actually all that much to re-enable SR-IOV. >> Restoring the PCI SR-IOV configuration space and re-enabling the bus >> master enable. > > Bus master is a pretty interesting example of the trivial control a pf > driver can exert on the vfs though. I'm actually not sure if it matters or not. I know the VFs have their own bus-master enable bits and I may be thinking of more of an implementation specific thing rather than something that is really spec compliant. >> The Amazon guys would probably know better than I since I haven't >> really worked much with one of these parts yet. Actually the virtio >> that Mark pushed may behave the same way too. As far as I know in >> these firmware cases the hardware itself has everything >> pre-partitioned and set to re-enable as soon as the SR-IOV bits are >> set. I think all they need is a few bits toggled and they are pretty >> much good to go. > > Are they just looking for an sriov capable stub driver? With > increasing vf counts, being able to use something like vfio-pci on the > pf seems like all risk with statistically insignificant increase in > density. On the other hand, if there's a userspace pf management > driver, why not just make it trusted by adding it as a native host > kernel driver? If we're talking about tainting the host kernel to > enable this interaction, maybe it should just be tainted by an out of > tree, possibly non-gpl host pf driver anyway. There can't really be a > case where the pf would be used by an average user without some degree > of privilege or cooperation, right? It wouldn't be much of a stretch to think that something like that might be possible. I seem to remember when we first started SR-IOV somebody had a driver that was doing essentially that. Basically what you end up with is function 0 carrying the SR-IOV configuration, and any VFs either pop in or out of existence based on that, but the MMIO for the device may not have any visibility into anything beyond the resources it was pre-allocated. >> >> > I suppose if you really wanted we could add a new call that you could >> >> > put into the sriov_configure pointer that would just make it always >> >> > return error. Then that way the configuration could be locked until >> >> > the driver is unloaded. >> >> > >> >> >> I guess I don't understand how vfs can operate fully independent of the >> >> >> pf and why vfio-pci wouldn't just implement a dummy sriov_configure to >> >> >> avoid contending with such issues. >> >> > >> >> > This approach isn't without risk, but it isn't as if it is really a >> >> > new risk we are introducing, and we could do further work to help >> >> > smooth out issues like this. Much of this depends on the design of the >> >> > device and the drivers involved. What is happening in this case is >> >> > that the VFs are managed outside of the kernel itself either by some >> >> > user-space entity or by a firmware entity. >> > >> > But we can't simply split vfs and pfs as separate resources, the vf is >> > dependent on the pf and that implies a privilege hierarchy, if not at >> > least cooperation. The pf driver can intentionally or unintentionally >> > disconnect the vfs at any point in time, possibly leading to data loss >> > or at least denial of service for the vf consumer. I also don't trust >> > that there aren't numerous sriov implementations where the pf isn't >> > privileged to the point of being able to snoop data from the vf. So >> > what sort of usage models are we enabling for the vf? A firmware owned >> > vf subject to these conditions seems mostly pointless. >> >> In the case of these sort of parts the PF isn't really all that >> privileged. Usually the PF is just a VF with the SR-IOV capability >> hanging off of it. I suspect the only thing that might outright >> control anything would be the bus master enable bit. Everything else >> could be independent. The PF driver in such cases doesn't do much. It >> is basically just the host for the configuration space. > > This is sounding more like an sriov capable stub driver. Certainly not > all pfs behave the way you describe above, many are significantly more > privileged and even if not, bus master is a pretty trivial control > point. So we probably need a driver that claims devices known to > behave this way, or a meta driver that bind via dynamic IDs or > driver_override. The proposal here sort of covertly turns anything > that's not an sriov driver into that stub driver with no guarantee that > the subverted driver is a willing or safe participant. Yeah, I'm starting to think that way as well. The problem of leaving VFs floating around as enabled is starting to be problematic and would require much more of a refactor then I think we want here. >> >> So I was thinking about this some more. In the case of vfio-pci things >> >> are a bit different since you are essentially taking a given device >> >> and handing it to a VM or userspace and it doesn't guarantee a >> >> communication between the two. >> > >> > Doesn't guarantee communication or cooperation or even trust. >> >> Right, but at the same time I consider this to be a shoot yourself in >> the foot type scenario. If you are going to hand off your PF while VFs >> are active then you are asking for whatever repercussions you end up >> getting. I've added a warning and a TAINT_USER flag to my code at this >> point if you load an active PF in vfio, and I have added a function >> that locks the setting so it cannot be changed once we place a PF in >> the control of vfio-pci. >> >> The way I see it there are two scenarios. One where the PF is just a >> VF with an extra bit of SR-IOV configuration space floating off of it. >> The other is where we want to have SR-IOV enabled and have some third >> party managing the PF. The first one doesn't really care about the >> communication and would prefer that whatever owns the driver on the PF >> just ignore the SR-IOV portion of the configuration space. The second >> one actually cares and would want some sort of >> communication/cooperation but for now I don't see that as being the >> easiest thing to do so it might be best to just let it see a fixed >> number of VFs it just has to work with that. > > There's no such thing as a read-only sriov capability in the spec, > which is a problem we ran into a while ago, vfio-pci exposes the > capability, but protects it as read-only so the user cannot create > devices on the host. QEMU passed this through to the guest, but that > failed as soon as OVMF started supporting sriov as it's unable to size > the VF BAR resources. Now QEMU drops the sriov capability from the > guest capability chain. So it's not clear how this fixed number of vfs > plan works. Are we inventing our own capability for this? If the pf > driver is just a userspace driver and not a VM, maybe what we do now is > sufficient, otherwise there's no standard for exposing fixed vfs. So one thought I had is what if we provide an ioctl command that allows for setting the number of VFs through VFIO? Basically it wouldn't be exposed through the sysfs, but would be totally controlled by the userspace using the sriov_configure_unmanged call to enable/disable the VFs and ultimately cleaning it up when the device it is released. I would imagine all the Amazon guys were looking for is something where they could just have some sort of command that could reach through and manage their VF count from some daemon. If they assigned the interface to a guest it would not be able to touch the SR-IOV capability unless QEMU went and implemented it which likely isn't happening anytime soon. It would provide the userspace guys a way to allocate the VF resources, call this ioctl, and then start managing VFs in the case of a userspace managed entity instead of it just being a firmware or preallocated type SR-IOV configuration. >> >> My thought is to look at making SR-IOV configuration static or treat >> >> it as read-only when the vfio-pci driver is loaded on a given >> >> interface. In addition I would set the TAINT_USER flag and add a >> >> warning about loading vfio-pci on an active PF, and provide the number >> >> of VFs that were allocated. >> >> >> >> The idea with all of this being that we would at least have a partial >> >> lock on all of this so that you can allocate some number of VFs, and >> >> then start partitioning things up where you could assign the PF and >> >> VFs as needed to the various interfaces. Once the PF is assigned to >> >> the vfio-pci driver it would be locked in terms of the number of VFs >> >> that are provided so there would be no need for an extra communication >> >> channel between the PF driver and the host to deal with changes. >> >> >> >> Thoughts? >> > >> > Is an sriov configuration ever static? vfio is the most prolific user >> > of the pci reset interfaces, AFAIK. Even if we add support for >> > restoring the sriov configuration and even if the pf user isn't trying >> > to be malicious, vf users would need to be prepared for their device >> > arbitrarily dropping off the bus, mmio randomly (from their >> > perspective) being disabled, and maybe other side-effects of the user >> > just trying to use the pf device. Is there even any guarantee that a >> > pf driver can operate identically, regardless of the state of sriov on >> > the pf? It seems like the pf driver needs to be aware of which queues >> > are available for itself vs the vfs in some designs. Is there still >> > long term value in a solution for this if the kernel is tainted? >> >> As far as the device randomly resetting I don't really see how that is >> any different from what we already have in terms of solutions >> including stuff supported by in-driver SR-IOV. The Intel drivers are >> notorious for resetting for any little thing like an MTU change. Us >> resetting due to someone calling open/release on an interface wouldn't >> be anything new. In addition AER can always hit us too. Yes we don't >> have the same recovery mechanisms in place, but in the case of >> something such as this we probably don't need that complex of a >> recovery setup. > > [I think below paragraph is specifically answering last question above, > still long term value in kernel tainting solution] > >> I would think there probably is. If not we wouldn't have gotten the >> patches earlier that were still doing the tainting and warning and >> making use of the vfio-pci driver to enable SR-IOV. I suspect the use >> case for all this is to enable SR-IOV, setup the PF in vfio-pci, and >> then assign VFs into and out of guests. I don't see the PF doing a lot >> of moving after it is setup. The taint flag only really applies if >> someone is looking for support and quite honestly I figure for now the >> USER flag is appropriate for this kind of thing since we are deferring >> all the networking control to the PF which is managed by userspace. > > If we're going to throw up our hands and taint the kernel, then maybe > we could entertain the idea of doing that in a trivial vfio-pci > sriov_configure function. But does that really meet all the use cases > and what's the advantage of that vs Amazon (I guess they're the driver > here) tainting the kernel with their own driver? Mellanox has tried to > enable sriov in vfio-pci in the past, Cc Ilya. > >> > Minimally, it seems like the pf driver (the user driver, not vfio-pci) >> > needs to be a cooperating party in this exchange. How do we determine >> > that and what QoS guarantees are they providing by agreeing to it? >> >> I think the general idea here is that the user has already tested this >> out and determined for themselves that the user driver does what they >> want/need. If it didn't they wouldn't be going to all these lengths to >> set this up. >> >> > Otherwise it doesn't seem like sriov provides the sort of static, hard >> > partitioning of the device that you're asking for. A vf is a piece of >> > the pf and we've released ownership and control of the pf to a user. >> > Thanks, >> > >> > Alex >> >> I am pretty sure that you are describing is true of some, but not for >> all. I think the Amazon solutions and the virtio solution are doing >> hard partitioning of the part. I will leave it to those guys to speak >> for themselves since I don't know anything about the hardware design >> of those parts. > > I think we'd need device specific knowledge and enablement to be able > to take advantage of any hardware partitioning, otherwise we need to > assume the pf is privileged, as implemented in other sriov devices. What we would probably need to end up doing is coming up with a whitelist of parts that support SR-IOV via a unprivileged PF. Maybe some sort of PCI quirk that would add a new feature flag that says we could do this without tainting the kernel. As-is this kind of thing would still be very experimental/new for the kernel so to some extent using the taint flag is also reflecting that since I don't know how common any use case like this is at this point or how well it has been tested. > I'm also trying to imagine whether there's a solution via the new > vfio/mdev interface, where the mdev vendor driver would bind to the pf > and effectively present itself as the mdev device. The vendor driver > could provide sriov capabilities and bear the burden of ensuring that > the pf is used cooperatively. The only existing mdev vendor drivers are > vGPUs and rely on on-device DMA translation and isolation, such as > through GTTs, but there have been some thoughts on providing IOMMU based > isolation of mdev/sriov mixed devices (assuming DMA is even required > for userspace management of the pf in this use case). [Cc Kirti] > Thanks, > > Alex
On Thu, 1 Mar 2018 14:42:40 -0800 Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Thu, Mar 1, 2018 at 12:22 PM, Alex Williamson > <alex.williamson@redhat.com> wrote: > > On Wed, 28 Feb 2018 16:36:38 -0800 > > Alexander Duyck <alexander.duyck@gmail.com> wrote: > > > >> On Wed, Feb 28, 2018 at 2:59 PM, Alex Williamson > >> <alex.williamson@redhat.com> wrote: > >> > On Wed, 28 Feb 2018 09:49:21 -0800 > >> > Alexander Duyck <alexander.duyck@gmail.com> wrote: > >> > > >> >> On Tue, Feb 27, 2018 at 2:25 PM, Alexander Duyck > >> >> <alexander.duyck@gmail.com> wrote: > >> >> > On Tue, Feb 27, 2018 at 1:40 PM, Alex Williamson > >> >> > <alex.williamson@redhat.com> wrote: > >> >> >> On Tue, 27 Feb 2018 11:06:54 -0800 > >> >> >> Alexander Duyck <alexander.duyck@gmail.com> wrote: > >> >> >> > >> >> >>> From: Alexander Duyck <alexander.h.duyck@intel.com> > >> >> >>> > >> >> >>> This patch is meant to add support for SR-IOV on devices when the VFs are > >> >> >>> not managed by the kernel. Examples of recent patches attempting to do this > >> >> >>> include: > >> >> >> > >> >> >> It appears to enable sriov when the _pf_ is not managed by the > >> >> >> kernel, but by "managed" we mean that either there is no pf driver or > >> >> >> the pf driver doesn't provide an sriov_configure callback, > >> >> >> intentionally or otherwise. > >> >> >> > >> >> >>> virto - https://patchwork.kernel.org/patch/10241225/ > >> >> >>> pci-stub - https://patchwork.kernel.org/patch/10109935/ > >> >> >>> vfio - https://patchwork.kernel.org/patch/10103353/ > >> >> >>> uio - https://patchwork.kernel.org/patch/9974031/ > >> >> >> > >> >> >> So is the goal to get around the issues with enabling sriov on each of > >> >> >> the above drivers by doing it under the covers or are you really just > >> >> >> trying to enable sriov for a truly unmanage (no pf driver) case? For > >> >> >> example, should a driver explicitly not wanting sriov enabled implement > >> >> >> a dummy sriov_configure function? > >> >> >> > >> >> >>> Since this is quickly blowing up into a multi-driver problem it is probably > >> >> >>> best to implement this solution in one spot. > >> >> >>> > >> >> >>> This patch is an attempt to do that. What we do with this patch is provide > >> >> >>> a generic call to enable SR-IOV in the case that the PF driver is either > >> >> >>> not present, or the PF driver doesn't support configuring SR-IOV. > >> >> >>> > >> >> >>> A new sysfs value called sriov_unmanaged_autoprobe has been added. This > >> >> >>> value is used as the drivers_autoprobe setting of the VFs when they are > >> >> >>> being managed by an external entity such as userspace or device firmware > >> >> >>> instead of being managed by the kernel. > >> >> >> > >> >> >> Documentation/ABI/testing/sysfs-bus-pci update is missing. > >> >> > > >> >> > I can make sure to update that in the next version. > >> >> > > >> >> >>> One side effect of this change is that the sriov_drivers_autoprobe and > >> >> >>> sriov_unmanaged_autoprobe will only apply their updates when SR-IOV is > >> >> >>> disabled. Attempts to update them when SR-IOV is in use will only update > >> >> >>> the local value and will not update sriov->autoprobe. > >> >> >> > >> >> >> And we expect users to understand when sriov_drivers_autoprobe applies > >> >> >> vs sriov_unmanaged_autoprobe, even though they're using the same > >> >> >> interfaces to enable sriov? Are all combinations expected to work, ex. > >> >> >> unmanaged sriov is enabled, a native pf driver loads, vfs work? Not > >> >> >> only does it seems like there's opportunity to use this incorrectly, I > >> >> >> think maybe it might be difficult to use correctly. > >> >> >> > >> >> >>> I based my patch set originally on the patch by Mark Rustad but there isn't > >> >> >>> much left after going through and cleaning out the bits that were no longer > >> >> >>> needed, and after incorporating the feedback from David Miller. > >> >> >>> > >> >> >>> I have included the authors of the original 4 patches above in the Cc here. > >> >> >>> My hope is to get feedback and/or review on if this works for their use > >> >> >>> cases. > >> >> >>> > >> >> >>> Cc: Mark Rustad <mark.d.rustad@intel.com> > >> >> >>> Cc: Maximilian Heyne <mheyne@amazon.de> > >> >> >>> Cc: Liang-Min Wang <liang-min.wang@intel.com> > >> >> >>> Cc: David Woodhouse <dwmw@amazon.co.uk> > >> >> >>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> > >> >> >>> --- > >> >> >>> drivers/pci/iov.c | 27 +++++++++++++++++++- > >> >> >>> drivers/pci/pci-driver.c | 2 + > >> >> >>> drivers/pci/pci-sysfs.c | 62 +++++++++++++++++++++++++++++++++++++++++----- > >> >> >>> drivers/pci/pci.h | 4 ++- > >> >> >>> include/linux/pci.h | 1 + > >> >> >>> 5 files changed, 86 insertions(+), 10 deletions(-) > >> >> >>> > >> >> >>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > >> >> >>> index 677924ae0350..7b8858bd8d03 100644 > >> >> >>> --- a/drivers/pci/iov.c > >> >> >>> +++ b/drivers/pci/iov.c > >> >> >>> @@ -446,6 +446,7 @@ static int sriov_init(struct pci_dev *dev, int pos) > >> >> >>> pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, &iov->vf_device); > >> >> >>> iov->pgsz = pgsz; > >> >> >>> iov->self = dev; > >> >> >>> + iov->autoprobe = true; > >> >> >>> iov->drivers_autoprobe = true; > >> >> >>> pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, &iov->cap); > >> >> >>> pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, &iov->link); > >> >> >>> @@ -643,8 +644,11 @@ void pci_restore_iov_state(struct pci_dev *dev) > >> >> >>> */ > >> >> >>> void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool auto_probe) > >> >> >>> { > >> >> >>> - if (dev->is_physfn) > >> >> >>> + if (dev->is_physfn) { > >> >> >>> dev->sriov->drivers_autoprobe = auto_probe; > >> >> >>> + if (!dev->sriov->num_VFs) > >> >> >>> + dev->sriov->autoprobe = auto_probe; > >> >> >> > >> >> >> Why is dev->sriov->autoprobe set any time other than immediately prior > >> >> >> to enabling VFs? > >> >> > > >> >> > My concern here was drivers that are still floating around with the > >> >> > old module parameter option for enabling SR-IOV. In the unlikely event > >> >> > that somebody was to use such a driver I wanted to make certain that > >> >> > the drivers_autoprobe state was pre-populated. > >> > > >> > Good point, but maybe that just means we should be setting it in > >> > sriov_enable()? > >> > >> I suppose we could. I would just have to check and see if we have any > >> drivers lurking out there that are supporting SR-IOV without > >> supporting the sysfs approach. As long as that is the case we could > >> probably put it there. > > So in my follow-on I am now taking a slightly different route. I moved > things over to us setting autoprobe in pci_enable_sriov, and now just > bypass the calls to it and pci_disable_sriov in favor of setting > autoprobe and then using sriov_enable or sriov_disable in the > pci_sriov_configure_unmanged call. > > That should get us the same effect as what we were looking for. > > >> >> >>> + } > >> >> >>> } > >> >> >>> > >> >> >>> /** > >> >> >>> @@ -703,6 +707,27 @@ void pci_disable_sriov(struct pci_dev *dev) > >> >> >>> EXPORT_SYMBOL_GPL(pci_disable_sriov); > >> >> >>> > >> >> >>> /** > >> >> >>> + * pci_sriov_configure_unmanaged - helper to configure unmanaged SR-IOV > >> >> >>> + * @dev: the PCI device > >> >> >>> + * @nr_virtfn: number of virtual functions to enable, 0 to disable > >> >> >>> + * > >> >> >>> + * Used to provide generic enable/disable SR-IOV option for devices > >> >> >>> + * that do not manage the VFs generated by their driver, or have no > >> >> >>> + * driver present. > >> >> >>> + */ > >> >> >>> +int pci_sriov_configure_unmanaged(struct pci_dev *dev, int nr_virtfn) > >> >> >>> +{ > >> >> >>> + int rc = 0; > >> >> >>> + > >> >> >>> + if (!nr_virtfn) > >> >> >>> + pci_disable_sriov(dev); > >> >> >>> + else > >> >> >>> + rc = pci_enable_sriov(dev, nr_virtfn); > >> >> >>> + > >> >> >>> + return rc ? rc : nr_virtfn; > >> >> >>> +} > >> >> >>> + > >> >> >>> +/** > >> >> >>> * pci_num_vf - return number of VFs associated with a PF device_release_driver > >> >> >>> * @dev: the PCI device > >> >> >>> * > >> >> >>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c > >> >> >>> index 3bed6beda051..2cc68dff6130 100644 > >> >> >>> --- a/drivers/pci/pci-driver.c > >> >> >>> +++ b/drivers/pci/pci-driver.c > >> >> >>> @@ -398,7 +398,7 @@ void __weak pcibios_free_irq(struct pci_dev *dev) > >> >> >>> #ifdef CONFIG_PCI_IOV > >> >> >>> static inline bool pci_device_can_probe(struct pci_dev *pdev) > >> >> >>> { > >> >> >>> - return (!pdev->is_virtfn || pdev->physfn->sriov->drivers_autoprobe); > >> >> >>> + return (!pdev->is_virtfn || pdev->physfn->sriov->autoprobe); > >> >> >>> } > >> >> >>> #else > >> >> >>> static inline bool pci_device_can_probe(struct pci_dev *pdev) > >> >> >>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c > >> >> >>> index eb6bee8724cc..e701b6dbc267 100644 > >> >> >>> --- a/drivers/pci/pci-sysfs.c > >> >> >>> +++ b/drivers/pci/pci-sysfs.c > >> >> >>> @@ -605,6 +605,7 @@ static ssize_t sriov_numvfs_store(struct device *dev, > >> >> >>> struct device_attribute *attr, > >> >> >>> const char *buf, size_t count) > >> >> >>> { > >> >> >>> + int (*sriov_configure)(struct pci_dev *dev, int num_vfs); > >> >> >>> struct pci_dev *pdev = to_pci_dev(dev); > >> >> >>> int ret; > >> >> >>> u16 num_vfs; > >> >> >>> @@ -622,15 +623,20 @@ static ssize_t sriov_numvfs_store(struct device *dev, > >> >> >>> goto exit; > >> >> >>> > >> >> >>> /* is PF driver loaded w/callback */ > >> >> >>> - if (!pdev->driver || !pdev->driver->sriov_configure) { > >> >> >>> - pci_info(pdev, "Driver doesn't support SRIOV configuration via sysfs\n"); > >> >> >>> - ret = -ENOENT; > >> >> >>> - goto exit; > >> >> >>> - } > >> >> >>> + if (pdev->driver && pdev->driver->sriov_configure) > >> >> >>> + sriov_configure = pdev->driver->sriov_configure; > >> >> >>> + else > >> >> >>> + sriov_configure = pci_sriov_configure_unmanaged; > >> >> >> > >> >> >> So an unwitting user is now able to enable vfs, independent of the > >> >> >> pf... the trouble being that they're probably going to expect them to > >> >> >> work and the more common case is that they won't. For instance, what > >> >> >> can you do with an igbvf when igb isn't managing the pf? > >> >> > > >> >> > Well the VFs wouldn't be able to do anything. Basically they would be > >> >> > sitting there with no driver loaded on them unless they are assigned > >> >> > to a guest, or the root user had enabled the unmanaged option. If you > >> >> > did load a driver on it the VF would sit there with link down unless > >> >> > either the PF driver is loaded or some user-space entity steps in to > >> >> > start managing the PF. > >> >> > > >> >> > In reality this can already happen as last I recall igb and ixgbe were > >> >> > already capable of having the PF driver removed when SR-IOV was > >> >> > enabled and VFs were assigned. Basically the VFs just report link down > >> >> > and don't do anything. Reloading the PF driver would have it take over > >> >> > in the case of igb and ixgbe since they were designed to handle that > >> >> > type of scenario. > >> > > >> > I think only ixgbe behaves this way between the two, igb disables sriov > >> > in its remove function unless we're hung up by that silly > >> > pci_vfs_assigned() check, which doesn't apply to vfio assignment. > >> > Regardless, yes, some pf drivers do leave sriov enabled on remove, > >> > whether that's useful or reasonable is another question. > >> > >> Right. We can argue that another day. I was just sighting the igb behavior. > >> > >> >> >> Or what happens when vfio-pci owns the pf, sriov is enabled via the > >> >> >> unmanaged interface, and the pf user driver segfaults and gets killed, > >> >> >> causing vfio-pci to restore the pf state, including wiping the sriov > >> >> >> config? > >> >> > > >> >> > Wiping the config shouldn't have any effect on the allocated VF pci > >> >> > devices. It will cause the VFs to act as though they have fallen off > >> >> > of the bus though and the guests would see a surprise remove type > >> >> > behavior if I am not mistaken. The MMIO and config accesses would fail > >> >> > until the SR-IOV configuration is restored. Really this shouldn't be a > >> >> > problem as long as the SR-IOV is enabled prior to loading the vfio-pci > >> >> > driver as I would assume it would restore the state an re-enable > >> >> > SR-IOV. > >> > > >> > Nope, sriov does not appear to be part of the device state saved and > >> > restored around reset, so I think we just end up in an inconsistent > >> > state. Also, vfio-pci stores a copy of the saved state of a device > >> > prior to giving the user access to the device and tries to restore that > >> > copy when released by the user, so anything that's part of the save > >> > state and modified via side-channels while the device is opened, would > >> > still be lost. > >> > >> Well it is but it isn't. The pci_restore_state will call > >> pci_restore_iov_state which will just repopulate the data straight out > >> of the pdev->sriov structure. I think the assumption is we were > >> already carrying enough information that there wasn't any point in > >> saving the state since we could just restore it from that structure > >> anyway. > > > > Ah, right, I remember that now. > > > >> I've been chasing it down and can verify that much this morning after > >> testing. A regular hit of the sysfs reset control will not erase the > >> state. > > > > Perhaps long term, but momentarily the vf disappears and that could > > imply data loss, no? > > It could imply data loss, but really it is no worse then accidentally > unplugging an Ethernet cable, or at least that has been my experience > with the Intel Ethernet drivers. Are NIC vfs a good choice for least common denominator behavior here? Links dropping are normal for NICs, they have protocols for that. I'm not sure something like an NMVe or SATA vf would necessarily be as resilient. Of course, even if we limit the discussion to NIC vfs, how much work can the vf do if the pf is constantly triggering the link down? Maybe it's not data loss, but the opportunity for DoS is still high. > >> One issue that I think I found that may be a bug though is if I > >> load the vfio-pci driver on an interface and unbind it I am seeing the > >> port state reset. I have it chased down to the idle D3 code. It looks > >> like going from vfio_pci_probe to vfio_pci_remove is triggering the > >> equivilent of the pci_pm_reset since it is cycling me through > >> D0->D3->D0 without restoring state after the fact. I also verified > >> that setting disable_idle_d3 resolves the issue. Would you have any > >> complaints about me doing a save_state in the probe call, and a > >> restore_state in the remove? It seems like that should probably be the > >> easiest fix. > > > > Hmm, perhaps I had assumed that pci_set_power_state() would handle such > > things, but indeed I do see other drivers calling pci_save_state() > > prior and pci_restore_state() after. I think we'd need to audit all of > > vfio-pci's calls to pci_set_power_state(), not just probe and remove. > > Is it okay if I leave that to you then? I'm thinking that might be > getting into things I am not really prepared to tackle. Sure, thanks for spotting it. > >> >> > In the grand scheme of things how would the situation you describe be > >> >> > any different than someone using the "reset" sysfs control on the PF > >> >> > while using SR-IOV with driver supported SR-IOV? > >> > > >> > Similar, but the pf driver has enabled sriov and can register the > >> > reset_done error handler to be notified to re-enable sriov or perform > >> > other cleanup. If the pf driver is not participating in sriov, it > >> > would seem exceptional to support this. Triggering reset via sysfs > >> > also falls more closely to shooting yourself in the foot than I think > >> > we want to design into driver/subsystem APIs. > >> > >> To some extent I agree with the shooting yourself in the foot. At the > >> same time there isn't actually all that much to re-enable SR-IOV. > >> Restoring the PCI SR-IOV configuration space and re-enabling the bus > >> master enable. > > > > Bus master is a pretty interesting example of the trivial control a pf > > driver can exert on the vfs though. > > I'm actually not sure if it matters or not. I know the VFs have their > own bus-master enable bits and I may be thinking of more of an > implementation specific thing rather than something that is really > spec compliant. > > >> The Amazon guys would probably know better than I since I haven't > >> really worked much with one of these parts yet. Actually the virtio > >> that Mark pushed may behave the same way too. As far as I know in > >> these firmware cases the hardware itself has everything > >> pre-partitioned and set to re-enable as soon as the SR-IOV bits are > >> set. I think all they need is a few bits toggled and they are pretty > >> much good to go. > > > > Are they just looking for an sriov capable stub driver? With > > increasing vf counts, being able to use something like vfio-pci on the > > pf seems like all risk with statistically insignificant increase in > > density. On the other hand, if there's a userspace pf management > > driver, why not just make it trusted by adding it as a native host > > kernel driver? If we're talking about tainting the host kernel to > > enable this interaction, maybe it should just be tainted by an out of > > tree, possibly non-gpl host pf driver anyway. There can't really be a > > case where the pf would be used by an average user without some degree > > of privilege or cooperation, right? > > It wouldn't be much of a stretch to think that something like that > might be possible. I seem to remember when we first started SR-IOV > somebody had a driver that was doing essentially that. Basically what > you end up with is function 0 carrying the SR-IOV configuration, and > any VFs either pop in or out of existence based on that, but the MMIO > for the device may not have any visibility into anything beyond the > resources it was pre-allocated. > > >> >> > I suppose if you really wanted we could add a new call that you could > >> >> > put into the sriov_configure pointer that would just make it always > >> >> > return error. Then that way the configuration could be locked until > >> >> > the driver is unloaded. > >> >> > > >> >> >> I guess I don't understand how vfs can operate fully independent of the > >> >> >> pf and why vfio-pci wouldn't just implement a dummy sriov_configure to > >> >> >> avoid contending with such issues. > >> >> > > >> >> > This approach isn't without risk, but it isn't as if it is really a > >> >> > new risk we are introducing, and we could do further work to help > >> >> > smooth out issues like this. Much of this depends on the design of the > >> >> > device and the drivers involved. What is happening in this case is > >> >> > that the VFs are managed outside of the kernel itself either by some > >> >> > user-space entity or by a firmware entity. > >> > > >> > But we can't simply split vfs and pfs as separate resources, the vf is > >> > dependent on the pf and that implies a privilege hierarchy, if not at > >> > least cooperation. The pf driver can intentionally or unintentionally > >> > disconnect the vfs at any point in time, possibly leading to data loss > >> > or at least denial of service for the vf consumer. I also don't trust > >> > that there aren't numerous sriov implementations where the pf isn't > >> > privileged to the point of being able to snoop data from the vf. So > >> > what sort of usage models are we enabling for the vf? A firmware owned > >> > vf subject to these conditions seems mostly pointless. > >> > >> In the case of these sort of parts the PF isn't really all that > >> privileged. Usually the PF is just a VF with the SR-IOV capability > >> hanging off of it. I suspect the only thing that might outright > >> control anything would be the bus master enable bit. Everything else > >> could be independent. The PF driver in such cases doesn't do much. It > >> is basically just the host for the configuration space. > > > > This is sounding more like an sriov capable stub driver. Certainly not > > all pfs behave the way you describe above, many are significantly more > > privileged and even if not, bus master is a pretty trivial control > > point. So we probably need a driver that claims devices known to > > behave this way, or a meta driver that bind via dynamic IDs or > > driver_override. The proposal here sort of covertly turns anything > > that's not an sriov driver into that stub driver with no guarantee that > > the subverted driver is a willing or safe participant. > > Yeah, I'm starting to think that way as well. The problem of leaving > VFs floating around as enabled is starting to be problematic and would > require much more of a refactor then I think we want here. > > >> >> So I was thinking about this some more. In the case of vfio-pci things > >> >> are a bit different since you are essentially taking a given device > >> >> and handing it to a VM or userspace and it doesn't guarantee a > >> >> communication between the two. > >> > > >> > Doesn't guarantee communication or cooperation or even trust. > >> > >> Right, but at the same time I consider this to be a shoot yourself in > >> the foot type scenario. If you are going to hand off your PF while VFs > >> are active then you are asking for whatever repercussions you end up > >> getting. I've added a warning and a TAINT_USER flag to my code at this > >> point if you load an active PF in vfio, and I have added a function > >> that locks the setting so it cannot be changed once we place a PF in > >> the control of vfio-pci. > >> > >> The way I see it there are two scenarios. One where the PF is just a > >> VF with an extra bit of SR-IOV configuration space floating off of it. > >> The other is where we want to have SR-IOV enabled and have some third > >> party managing the PF. The first one doesn't really care about the > >> communication and would prefer that whatever owns the driver on the PF > >> just ignore the SR-IOV portion of the configuration space. The second > >> one actually cares and would want some sort of > >> communication/cooperation but for now I don't see that as being the > >> easiest thing to do so it might be best to just let it see a fixed > >> number of VFs it just has to work with that. > > > > There's no such thing as a read-only sriov capability in the spec, > > which is a problem we ran into a while ago, vfio-pci exposes the > > capability, but protects it as read-only so the user cannot create > > devices on the host. QEMU passed this through to the guest, but that > > failed as soon as OVMF started supporting sriov as it's unable to size > > the VF BAR resources. Now QEMU drops the sriov capability from the > > guest capability chain. So it's not clear how this fixed number of vfs > > plan works. Are we inventing our own capability for this? If the pf > > driver is just a userspace driver and not a VM, maybe what we do now is > > sufficient, otherwise there's no standard for exposing fixed vfs. > > So one thought I had is what if we provide an ioctl command that > allows for setting the number of VFs through VFIO? Basically it > wouldn't be exposed through the sysfs, but would be totally controlled > by the userspace using the sriov_configure_unmanged call to > enable/disable the VFs and ultimately cleaning it up when the device > it is released. I would imagine all the Amazon guys were looking for > is something where they could just have some sort of command that > could reach through and manage their VF count from some daemon. If > they assigned the interface to a guest it would not be able to touch > the SR-IOV capability unless QEMU went and implemented it which likely > isn't happening anytime soon. It would provide the userspace guys a > way to allocate the VF resources, call this ioctl, and then start > managing VFs in the case of a userspace managed entity instead of it > just being a firmware or preallocated type SR-IOV configuration. Wait a sec, we were considering that a user owned pf sourcing vfs would taint the host kernel and now we want to give the user the ability to enable those vfs and trigger that tainting? I saw the route through sysfs to enable sriov on a user owned device as a feature because it would require an existing privilege path rather than creating one through vfio. For example, an unprivileged QEMU could submit a request to a privileged entity to enable sriov through sysfs. If we wanted to allow the user to enable sriov directly, the user would at least need to run with privileges like CAP_SYS_ADMIN. Side note: we could just enable emulation of the sriov capability through config space in vfio-pci if we wanted to allow that, rather than adding a new ioctl. > >> >> My thought is to look at making SR-IOV configuration static or treat > >> >> it as read-only when the vfio-pci driver is loaded on a given > >> >> interface. In addition I would set the TAINT_USER flag and add a > >> >> warning about loading vfio-pci on an active PF, and provide the number > >> >> of VFs that were allocated. > >> >> > >> >> The idea with all of this being that we would at least have a partial > >> >> lock on all of this so that you can allocate some number of VFs, and > >> >> then start partitioning things up where you could assign the PF and > >> >> VFs as needed to the various interfaces. Once the PF is assigned to > >> >> the vfio-pci driver it would be locked in terms of the number of VFs > >> >> that are provided so there would be no need for an extra communication > >> >> channel between the PF driver and the host to deal with changes. > >> >> > >> >> Thoughts? > >> > > >> > Is an sriov configuration ever static? vfio is the most prolific user > >> > of the pci reset interfaces, AFAIK. Even if we add support for > >> > restoring the sriov configuration and even if the pf user isn't trying > >> > to be malicious, vf users would need to be prepared for their device > >> > arbitrarily dropping off the bus, mmio randomly (from their > >> > perspective) being disabled, and maybe other side-effects of the user > >> > just trying to use the pf device. Is there even any guarantee that a > >> > pf driver can operate identically, regardless of the state of sriov on > >> > the pf? It seems like the pf driver needs to be aware of which queues > >> > are available for itself vs the vfs in some designs. Is there still > >> > long term value in a solution for this if the kernel is tainted? > >> > >> As far as the device randomly resetting I don't really see how that is > >> any different from what we already have in terms of solutions > >> including stuff supported by in-driver SR-IOV. The Intel drivers are > >> notorious for resetting for any little thing like an MTU change. Us > >> resetting due to someone calling open/release on an interface wouldn't > >> be anything new. In addition AER can always hit us too. Yes we don't > >> have the same recovery mechanisms in place, but in the case of > >> something such as this we probably don't need that complex of a > >> recovery setup. > > > > [I think below paragraph is specifically answering last question above, > > still long term value in kernel tainting solution] > > > >> I would think there probably is. If not we wouldn't have gotten the > >> patches earlier that were still doing the tainting and warning and > >> making use of the vfio-pci driver to enable SR-IOV. I suspect the use > >> case for all this is to enable SR-IOV, setup the PF in vfio-pci, and > >> then assign VFs into and out of guests. I don't see the PF doing a lot > >> of moving after it is setup. The taint flag only really applies if > >> someone is looking for support and quite honestly I figure for now the > >> USER flag is appropriate for this kind of thing since we are deferring > >> all the networking control to the PF which is managed by userspace. > > > > If we're going to throw up our hands and taint the kernel, then maybe > > we could entertain the idea of doing that in a trivial vfio-pci > > sriov_configure function. But does that really meet all the use cases > > and what's the advantage of that vs Amazon (I guess they're the driver > > here) tainting the kernel with their own driver? Mellanox has tried to > > enable sriov in vfio-pci in the past, Cc Ilya. > > > >> > Minimally, it seems like the pf driver (the user driver, not vfio-pci) > >> > needs to be a cooperating party in this exchange. How do we determine > >> > that and what QoS guarantees are they providing by agreeing to it? > >> > >> I think the general idea here is that the user has already tested this > >> out and determined for themselves that the user driver does what they > >> want/need. If it didn't they wouldn't be going to all these lengths to > >> set this up. > >> > >> > Otherwise it doesn't seem like sriov provides the sort of static, hard > >> > partitioning of the device that you're asking for. A vf is a piece of > >> > the pf and we've released ownership and control of the pf to a user. > >> > Thanks, > >> > > >> > Alex > >> > >> I am pretty sure that you are describing is true of some, but not for > >> all. I think the Amazon solutions and the virtio solution are doing > >> hard partitioning of the part. I will leave it to those guys to speak > >> for themselves since I don't know anything about the hardware design > >> of those parts. > > > > I think we'd need device specific knowledge and enablement to be able > > to take advantage of any hardware partitioning, otherwise we need to > > assume the pf is privileged, as implemented in other sriov devices. > > What we would probably need to end up doing is coming up with a > whitelist of parts that support SR-IOV via a unprivileged PF. Maybe > some sort of PCI quirk that would add a new feature flag that says we > could do this without tainting the kernel. As-is this kind of thing > would still be very experimental/new for the kernel so to some extent > using the taint flag is also reflecting that since I don't know how > common any use case like this is at this point or how well it has been > tested. I guess I'm not convinced that "supported sriov via an unprivileged pf" is a thing that exists in this universe. It sounds like something a vendor would claim to have because they want this to work regardless of whether it's true or the right thing to do. Even if the pf can only trigger an flr and that momentarily makes the vfs blip out and back, a malicious user can do that in a loop and disrupt any useful application of the vfs. Thanks, Alex
On Thu, Mar 1, 2018 at 3:58 PM, Alex Williamson <alex.williamson@redhat.com> wrote: > On Thu, 1 Mar 2018 14:42:40 -0800 > Alexander Duyck <alexander.duyck@gmail.com> wrote: > >> On Thu, Mar 1, 2018 at 12:22 PM, Alex Williamson >> <alex.williamson@redhat.com> wrote: >> > On Wed, 28 Feb 2018 16:36:38 -0800 >> > Alexander Duyck <alexander.duyck@gmail.com> wrote: >> > >> >> On Wed, Feb 28, 2018 at 2:59 PM, Alex Williamson >> >> <alex.williamson@redhat.com> wrote: >> >> > On Wed, 28 Feb 2018 09:49:21 -0800 >> >> > Alexander Duyck <alexander.duyck@gmail.com> wrote: >> >> > >> >> >> On Tue, Feb 27, 2018 at 2:25 PM, Alexander Duyck >> >> >> <alexander.duyck@gmail.com> wrote: >> >> >> > On Tue, Feb 27, 2018 at 1:40 PM, Alex Williamson >> >> >> > <alex.williamson@redhat.com> wrote: >> >> >> >> On Tue, 27 Feb 2018 11:06:54 -0800 >> >> >> >> Alexander Duyck <alexander.duyck@gmail.com> wrote: >> >> >> >> >> >> >> >>> From: Alexander Duyck <alexander.h.duyck@intel.com> >> >> >> >>> >> >> >> >>> This patch is meant to add support for SR-IOV on devices when the VFs are >> >> >> >>> not managed by the kernel. Examples of recent patches attempting to do this >> >> >> >>> include: >> >> >> >> >> >> >> >> It appears to enable sriov when the _pf_ is not managed by the >> >> >> >> kernel, but by "managed" we mean that either there is no pf driver or >> >> >> >> the pf driver doesn't provide an sriov_configure callback, >> >> >> >> intentionally or otherwise. >> >> >> >> >> >> >> >>> virto - https://patchwork.kernel.org/patch/10241225/ >> >> >> >>> pci-stub - https://patchwork.kernel.org/patch/10109935/ >> >> >> >>> vfio - https://patchwork.kernel.org/patch/10103353/ >> >> >> >>> uio - https://patchwork.kernel.org/patch/9974031/ >> >> >> >> >> >> >> >> So is the goal to get around the issues with enabling sriov on each of >> >> >> >> the above drivers by doing it under the covers or are you really just >> >> >> >> trying to enable sriov for a truly unmanage (no pf driver) case? For >> >> >> >> example, should a driver explicitly not wanting sriov enabled implement >> >> >> >> a dummy sriov_configure function? >> >> >> >> >> >> >> >>> Since this is quickly blowing up into a multi-driver problem it is probably >> >> >> >>> best to implement this solution in one spot. >> >> >> >>> >> >> >> >>> This patch is an attempt to do that. What we do with this patch is provide >> >> >> >>> a generic call to enable SR-IOV in the case that the PF driver is either >> >> >> >>> not present, or the PF driver doesn't support configuring SR-IOV. >> >> >> >>> >> >> >> >>> A new sysfs value called sriov_unmanaged_autoprobe has been added. This >> >> >> >>> value is used as the drivers_autoprobe setting of the VFs when they are >> >> >> >>> being managed by an external entity such as userspace or device firmware >> >> >> >>> instead of being managed by the kernel. >> >> >> >> >> >> >> >> Documentation/ABI/testing/sysfs-bus-pci update is missing. >> >> >> > >> >> >> > I can make sure to update that in the next version. >> >> >> > >> >> >> >>> One side effect of this change is that the sriov_drivers_autoprobe and >> >> >> >>> sriov_unmanaged_autoprobe will only apply their updates when SR-IOV is >> >> >> >>> disabled. Attempts to update them when SR-IOV is in use will only update >> >> >> >>> the local value and will not update sriov->autoprobe. >> >> >> >> >> >> >> >> And we expect users to understand when sriov_drivers_autoprobe applies >> >> >> >> vs sriov_unmanaged_autoprobe, even though they're using the same >> >> >> >> interfaces to enable sriov? Are all combinations expected to work, ex. >> >> >> >> unmanaged sriov is enabled, a native pf driver loads, vfs work? Not >> >> >> >> only does it seems like there's opportunity to use this incorrectly, I >> >> >> >> think maybe it might be difficult to use correctly. >> >> >> >> >> >> >> >>> I based my patch set originally on the patch by Mark Rustad but there isn't >> >> >> >>> much left after going through and cleaning out the bits that were no longer >> >> >> >>> needed, and after incorporating the feedback from David Miller. >> >> >> >>> >> >> >> >>> I have included the authors of the original 4 patches above in the Cc here. >> >> >> >>> My hope is to get feedback and/or review on if this works for their use >> >> >> >>> cases. >> >> >> >>> >> >> >> >>> Cc: Mark Rustad <mark.d.rustad@intel.com> >> >> >> >>> Cc: Maximilian Heyne <mheyne@amazon.de> >> >> >> >>> Cc: Liang-Min Wang <liang-min.wang@intel.com> >> >> >> >>> Cc: David Woodhouse <dwmw@amazon.co.uk> >> >> >> >>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >> >> >> >>> --- >> >> >> >>> drivers/pci/iov.c | 27 +++++++++++++++++++- >> >> >> >>> drivers/pci/pci-driver.c | 2 + >> >> >> >>> drivers/pci/pci-sysfs.c | 62 +++++++++++++++++++++++++++++++++++++++++----- >> >> >> >>> drivers/pci/pci.h | 4 ++- >> >> >> >>> include/linux/pci.h | 1 + >> >> >> >>> 5 files changed, 86 insertions(+), 10 deletions(-) >> >> >> >>> >> >> >> >>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c >> >> >> >>> index 677924ae0350..7b8858bd8d03 100644 >> >> >> >>> --- a/drivers/pci/iov.c >> >> >> >>> +++ b/drivers/pci/iov.c >> >> >> >>> @@ -446,6 +446,7 @@ static int sriov_init(struct pci_dev *dev, int pos) >> >> >> >>> pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, &iov->vf_device); >> >> >> >>> iov->pgsz = pgsz; >> >> >> >>> iov->self = dev; >> >> >> >>> + iov->autoprobe = true; >> >> >> >>> iov->drivers_autoprobe = true; >> >> >> >>> pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, &iov->cap); >> >> >> >>> pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, &iov->link); >> >> >> >>> @@ -643,8 +644,11 @@ void pci_restore_iov_state(struct pci_dev *dev) >> >> >> >>> */ >> >> >> >>> void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool auto_probe) >> >> >> >>> { >> >> >> >>> - if (dev->is_physfn) >> >> >> >>> + if (dev->is_physfn) { >> >> >> >>> dev->sriov->drivers_autoprobe = auto_probe; >> >> >> >>> + if (!dev->sriov->num_VFs) >> >> >> >>> + dev->sriov->autoprobe = auto_probe; >> >> >> >> >> >> >> >> Why is dev->sriov->autoprobe set any time other than immediately prior >> >> >> >> to enabling VFs? >> >> >> > >> >> >> > My concern here was drivers that are still floating around with the >> >> >> > old module parameter option for enabling SR-IOV. In the unlikely event >> >> >> > that somebody was to use such a driver I wanted to make certain that >> >> >> > the drivers_autoprobe state was pre-populated. >> >> > >> >> > Good point, but maybe that just means we should be setting it in >> >> > sriov_enable()? >> >> >> >> I suppose we could. I would just have to check and see if we have any >> >> drivers lurking out there that are supporting SR-IOV without >> >> supporting the sysfs approach. As long as that is the case we could >> >> probably put it there. >> >> So in my follow-on I am now taking a slightly different route. I moved >> things over to us setting autoprobe in pci_enable_sriov, and now just >> bypass the calls to it and pci_disable_sriov in favor of setting >> autoprobe and then using sriov_enable or sriov_disable in the >> pci_sriov_configure_unmanged call. >> >> That should get us the same effect as what we were looking for. >> >> >> >> >>> + } >> >> >> >>> } >> >> >> >>> >> >> >> >>> /** >> >> >> >>> @@ -703,6 +707,27 @@ void pci_disable_sriov(struct pci_dev *dev) >> >> >> >>> EXPORT_SYMBOL_GPL(pci_disable_sriov); >> >> >> >>> >> >> >> >>> /** >> >> >> >>> + * pci_sriov_configure_unmanaged - helper to configure unmanaged SR-IOV >> >> >> >>> + * @dev: the PCI device >> >> >> >>> + * @nr_virtfn: number of virtual functions to enable, 0 to disable >> >> >> >>> + * >> >> >> >>> + * Used to provide generic enable/disable SR-IOV option for devices >> >> >> >>> + * that do not manage the VFs generated by their driver, or have no >> >> >> >>> + * driver present. >> >> >> >>> + */ >> >> >> >>> +int pci_sriov_configure_unmanaged(struct pci_dev *dev, int nr_virtfn) >> >> >> >>> +{ >> >> >> >>> + int rc = 0; >> >> >> >>> + >> >> >> >>> + if (!nr_virtfn) >> >> >> >>> + pci_disable_sriov(dev); >> >> >> >>> + else >> >> >> >>> + rc = pci_enable_sriov(dev, nr_virtfn); >> >> >> >>> + >> >> >> >>> + return rc ? rc : nr_virtfn; >> >> >> >>> +} >> >> >> >>> + >> >> >> >>> +/** >> >> >> >>> * pci_num_vf - return number of VFs associated with a PF device_release_driver >> >> >> >>> * @dev: the PCI device >> >> >> >>> * >> >> >> >>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c >> >> >> >>> index 3bed6beda051..2cc68dff6130 100644 >> >> >> >>> --- a/drivers/pci/pci-driver.c >> >> >> >>> +++ b/drivers/pci/pci-driver.c >> >> >> >>> @@ -398,7 +398,7 @@ void __weak pcibios_free_irq(struct pci_dev *dev) >> >> >> >>> #ifdef CONFIG_PCI_IOV >> >> >> >>> static inline bool pci_device_can_probe(struct pci_dev *pdev) >> >> >> >>> { >> >> >> >>> - return (!pdev->is_virtfn || pdev->physfn->sriov->drivers_autoprobe); >> >> >> >>> + return (!pdev->is_virtfn || pdev->physfn->sriov->autoprobe); >> >> >> >>> } >> >> >> >>> #else >> >> >> >>> static inline bool pci_device_can_probe(struct pci_dev *pdev) >> >> >> >>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c >> >> >> >>> index eb6bee8724cc..e701b6dbc267 100644 >> >> >> >>> --- a/drivers/pci/pci-sysfs.c >> >> >> >>> +++ b/drivers/pci/pci-sysfs.c >> >> >> >>> @@ -605,6 +605,7 @@ static ssize_t sriov_numvfs_store(struct device *dev, >> >> >> >>> struct device_attribute *attr, >> >> >> >>> const char *buf, size_t count) >> >> >> >>> { >> >> >> >>> + int (*sriov_configure)(struct pci_dev *dev, int num_vfs); >> >> >> >>> struct pci_dev *pdev = to_pci_dev(dev); >> >> >> >>> int ret; >> >> >> >>> u16 num_vfs; >> >> >> >>> @@ -622,15 +623,20 @@ static ssize_t sriov_numvfs_store(struct device *dev, >> >> >> >>> goto exit; >> >> >> >>> >> >> >> >>> /* is PF driver loaded w/callback */ >> >> >> >>> - if (!pdev->driver || !pdev->driver->sriov_configure) { >> >> >> >>> - pci_info(pdev, "Driver doesn't support SRIOV configuration via sysfs\n"); >> >> >> >>> - ret = -ENOENT; >> >> >> >>> - goto exit; >> >> >> >>> - } >> >> >> >>> + if (pdev->driver && pdev->driver->sriov_configure) >> >> >> >>> + sriov_configure = pdev->driver->sriov_configure; >> >> >> >>> + else >> >> >> >>> + sriov_configure = pci_sriov_configure_unmanaged; >> >> >> >> >> >> >> >> So an unwitting user is now able to enable vfs, independent of the >> >> >> >> pf... the trouble being that they're probably going to expect them to >> >> >> >> work and the more common case is that they won't. For instance, what >> >> >> >> can you do with an igbvf when igb isn't managing the pf? >> >> >> > >> >> >> > Well the VFs wouldn't be able to do anything. Basically they would be >> >> >> > sitting there with no driver loaded on them unless they are assigned >> >> >> > to a guest, or the root user had enabled the unmanaged option. If you >> >> >> > did load a driver on it the VF would sit there with link down unless >> >> >> > either the PF driver is loaded or some user-space entity steps in to >> >> >> > start managing the PF. >> >> >> > >> >> >> > In reality this can already happen as last I recall igb and ixgbe were >> >> >> > already capable of having the PF driver removed when SR-IOV was >> >> >> > enabled and VFs were assigned. Basically the VFs just report link down >> >> >> > and don't do anything. Reloading the PF driver would have it take over >> >> >> > in the case of igb and ixgbe since they were designed to handle that >> >> >> > type of scenario. >> >> > >> >> > I think only ixgbe behaves this way between the two, igb disables sriov >> >> > in its remove function unless we're hung up by that silly >> >> > pci_vfs_assigned() check, which doesn't apply to vfio assignment. >> >> > Regardless, yes, some pf drivers do leave sriov enabled on remove, >> >> > whether that's useful or reasonable is another question. >> >> >> >> Right. We can argue that another day. I was just sighting the igb behavior. >> >> >> >> >> >> Or what happens when vfio-pci owns the pf, sriov is enabled via the >> >> >> >> unmanaged interface, and the pf user driver segfaults and gets killed, >> >> >> >> causing vfio-pci to restore the pf state, including wiping the sriov >> >> >> >> config? >> >> >> > >> >> >> > Wiping the config shouldn't have any effect on the allocated VF pci >> >> >> > devices. It will cause the VFs to act as though they have fallen off >> >> >> > of the bus though and the guests would see a surprise remove type >> >> >> > behavior if I am not mistaken. The MMIO and config accesses would fail >> >> >> > until the SR-IOV configuration is restored. Really this shouldn't be a >> >> >> > problem as long as the SR-IOV is enabled prior to loading the vfio-pci >> >> >> > driver as I would assume it would restore the state an re-enable >> >> >> > SR-IOV. >> >> > >> >> > Nope, sriov does not appear to be part of the device state saved and >> >> > restored around reset, so I think we just end up in an inconsistent >> >> > state. Also, vfio-pci stores a copy of the saved state of a device >> >> > prior to giving the user access to the device and tries to restore that >> >> > copy when released by the user, so anything that's part of the save >> >> > state and modified via side-channels while the device is opened, would >> >> > still be lost. >> >> >> >> Well it is but it isn't. The pci_restore_state will call >> >> pci_restore_iov_state which will just repopulate the data straight out >> >> of the pdev->sriov structure. I think the assumption is we were >> >> already carrying enough information that there wasn't any point in >> >> saving the state since we could just restore it from that structure >> >> anyway. >> > >> > Ah, right, I remember that now. >> > >> >> I've been chasing it down and can verify that much this morning after >> >> testing. A regular hit of the sysfs reset control will not erase the >> >> state. >> > >> > Perhaps long term, but momentarily the vf disappears and that could >> > imply data loss, no? >> >> It could imply data loss, but really it is no worse then accidentally >> unplugging an Ethernet cable, or at least that has been my experience >> with the Intel Ethernet drivers. > > Are NIC vfs a good choice for least common denominator behavior here? > Links dropping are normal for NICs, they have protocols for that. I'm > not sure something like an NMVe or SATA vf would necessarily be as > resilient. Of course, even if we limit the discussion to NIC vfs, how > much work can the vf do if the pf is constantly triggering the link > down? Maybe it's not data loss, but the opportunity for DoS is still > high. Well NIC VFs are what I know. As far as the DoS potential I agree that is pretty high. That is one of the reasons why I am thinking about backing away from doing this as a global thing and switching back to a per driver solution for this. I think with the "sriov_configure_unmanaged" function and the new sysfs attributed for for unmanaged_probe I have half the problem solved. The issue becomes how to appropriately handle those so that I don't have a huge mess of bugs or security issues being introduced. >> >> One issue that I think I found that may be a bug though is if I >> >> load the vfio-pci driver on an interface and unbind it I am seeing the >> >> port state reset. I have it chased down to the idle D3 code. It looks >> >> like going from vfio_pci_probe to vfio_pci_remove is triggering the >> >> equivilent of the pci_pm_reset since it is cycling me through >> >> D0->D3->D0 without restoring state after the fact. I also verified >> >> that setting disable_idle_d3 resolves the issue. Would you have any >> >> complaints about me doing a save_state in the probe call, and a >> >> restore_state in the remove? It seems like that should probably be the >> >> easiest fix. >> > >> > Hmm, perhaps I had assumed that pci_set_power_state() would handle such >> > things, but indeed I do see other drivers calling pci_save_state() >> > prior and pci_restore_state() after. I think we'd need to audit all of >> > vfio-pci's calls to pci_set_power_state(), not just probe and remove. >> >> Is it okay if I leave that to you then? I'm thinking that might be >> getting into things I am not really prepared to tackle. > > Sure, thanks for spotting it. > >> >> >> > In the grand scheme of things how would the situation you describe be >> >> >> > any different than someone using the "reset" sysfs control on the PF >> >> >> > while using SR-IOV with driver supported SR-IOV? >> >> > >> >> > Similar, but the pf driver has enabled sriov and can register the >> >> > reset_done error handler to be notified to re-enable sriov or perform >> >> > other cleanup. If the pf driver is not participating in sriov, it >> >> > would seem exceptional to support this. Triggering reset via sysfs >> >> > also falls more closely to shooting yourself in the foot than I think >> >> > we want to design into driver/subsystem APIs. >> >> >> >> To some extent I agree with the shooting yourself in the foot. At the >> >> same time there isn't actually all that much to re-enable SR-IOV. >> >> Restoring the PCI SR-IOV configuration space and re-enabling the bus >> >> master enable. >> > >> > Bus master is a pretty interesting example of the trivial control a pf >> > driver can exert on the vfs though. >> >> I'm actually not sure if it matters or not. I know the VFs have their >> own bus-master enable bits and I may be thinking of more of an >> implementation specific thing rather than something that is really >> spec compliant. >> >> >> The Amazon guys would probably know better than I since I haven't >> >> really worked much with one of these parts yet. Actually the virtio >> >> that Mark pushed may behave the same way too. As far as I know in >> >> these firmware cases the hardware itself has everything >> >> pre-partitioned and set to re-enable as soon as the SR-IOV bits are >> >> set. I think all they need is a few bits toggled and they are pretty >> >> much good to go. >> > >> > Are they just looking for an sriov capable stub driver? With >> > increasing vf counts, being able to use something like vfio-pci on the >> > pf seems like all risk with statistically insignificant increase in >> > density. On the other hand, if there's a userspace pf management >> > driver, why not just make it trusted by adding it as a native host >> > kernel driver? If we're talking about tainting the host kernel to >> > enable this interaction, maybe it should just be tainted by an out of >> > tree, possibly non-gpl host pf driver anyway. There can't really be a >> > case where the pf would be used by an average user without some degree >> > of privilege or cooperation, right? >> >> It wouldn't be much of a stretch to think that something like that >> might be possible. I seem to remember when we first started SR-IOV >> somebody had a driver that was doing essentially that. Basically what >> you end up with is function 0 carrying the SR-IOV configuration, and >> any VFs either pop in or out of existence based on that, but the MMIO >> for the device may not have any visibility into anything beyond the >> resources it was pre-allocated. >> >> >> >> > I suppose if you really wanted we could add a new call that you could >> >> >> > put into the sriov_configure pointer that would just make it always >> >> >> > return error. Then that way the configuration could be locked until >> >> >> > the driver is unloaded. >> >> >> > >> >> >> >> I guess I don't understand how vfs can operate fully independent of the >> >> >> >> pf and why vfio-pci wouldn't just implement a dummy sriov_configure to >> >> >> >> avoid contending with such issues. >> >> >> > >> >> >> > This approach isn't without risk, but it isn't as if it is really a >> >> >> > new risk we are introducing, and we could do further work to help >> >> >> > smooth out issues like this. Much of this depends on the design of the >> >> >> > device and the drivers involved. What is happening in this case is >> >> >> > that the VFs are managed outside of the kernel itself either by some >> >> >> > user-space entity or by a firmware entity. >> >> > >> >> > But we can't simply split vfs and pfs as separate resources, the vf is >> >> > dependent on the pf and that implies a privilege hierarchy, if not at >> >> > least cooperation. The pf driver can intentionally or unintentionally >> >> > disconnect the vfs at any point in time, possibly leading to data loss >> >> > or at least denial of service for the vf consumer. I also don't trust >> >> > that there aren't numerous sriov implementations where the pf isn't >> >> > privileged to the point of being able to snoop data from the vf. So >> >> > what sort of usage models are we enabling for the vf? A firmware owned >> >> > vf subject to these conditions seems mostly pointless. >> >> >> >> In the case of these sort of parts the PF isn't really all that >> >> privileged. Usually the PF is just a VF with the SR-IOV capability >> >> hanging off of it. I suspect the only thing that might outright >> >> control anything would be the bus master enable bit. Everything else >> >> could be independent. The PF driver in such cases doesn't do much. It >> >> is basically just the host for the configuration space. >> > >> > This is sounding more like an sriov capable stub driver. Certainly not >> > all pfs behave the way you describe above, many are significantly more >> > privileged and even if not, bus master is a pretty trivial control >> > point. So we probably need a driver that claims devices known to >> > behave this way, or a meta driver that bind via dynamic IDs or >> > driver_override. The proposal here sort of covertly turns anything >> > that's not an sriov driver into that stub driver with no guarantee that >> > the subverted driver is a willing or safe participant. >> >> Yeah, I'm starting to think that way as well. The problem of leaving >> VFs floating around as enabled is starting to be problematic and would >> require much more of a refactor then I think we want here. >> >> >> >> So I was thinking about this some more. In the case of vfio-pci things >> >> >> are a bit different since you are essentially taking a given device >> >> >> and handing it to a VM or userspace and it doesn't guarantee a >> >> >> communication between the two. >> >> > >> >> > Doesn't guarantee communication or cooperation or even trust. >> >> >> >> Right, but at the same time I consider this to be a shoot yourself in >> >> the foot type scenario. If you are going to hand off your PF while VFs >> >> are active then you are asking for whatever repercussions you end up >> >> getting. I've added a warning and a TAINT_USER flag to my code at this >> >> point if you load an active PF in vfio, and I have added a function >> >> that locks the setting so it cannot be changed once we place a PF in >> >> the control of vfio-pci. >> >> >> >> The way I see it there are two scenarios. One where the PF is just a >> >> VF with an extra bit of SR-IOV configuration space floating off of it. >> >> The other is where we want to have SR-IOV enabled and have some third >> >> party managing the PF. The first one doesn't really care about the >> >> communication and would prefer that whatever owns the driver on the PF >> >> just ignore the SR-IOV portion of the configuration space. The second >> >> one actually cares and would want some sort of >> >> communication/cooperation but for now I don't see that as being the >> >> easiest thing to do so it might be best to just let it see a fixed >> >> number of VFs it just has to work with that. >> > >> > There's no such thing as a read-only sriov capability in the spec, >> > which is a problem we ran into a while ago, vfio-pci exposes the >> > capability, but protects it as read-only so the user cannot create >> > devices on the host. QEMU passed this through to the guest, but that >> > failed as soon as OVMF started supporting sriov as it's unable to size >> > the VF BAR resources. Now QEMU drops the sriov capability from the >> > guest capability chain. So it's not clear how this fixed number of vfs >> > plan works. Are we inventing our own capability for this? If the pf >> > driver is just a userspace driver and not a VM, maybe what we do now is >> > sufficient, otherwise there's no standard for exposing fixed vfs. >> >> So one thought I had is what if we provide an ioctl command that >> allows for setting the number of VFs through VFIO? Basically it >> wouldn't be exposed through the sysfs, but would be totally controlled >> by the userspace using the sriov_configure_unmanged call to >> enable/disable the VFs and ultimately cleaning it up when the device >> it is released. I would imagine all the Amazon guys were looking for >> is something where they could just have some sort of command that >> could reach through and manage their VF count from some daemon. If >> they assigned the interface to a guest it would not be able to touch >> the SR-IOV capability unless QEMU went and implemented it which likely >> isn't happening anytime soon. It would provide the userspace guys a >> way to allocate the VF resources, call this ioctl, and then start >> managing VFs in the case of a userspace managed entity instead of it >> just being a firmware or preallocated type SR-IOV configuration. > > Wait a sec, we were considering that a user owned pf sourcing vfs would > taint the host kernel and now we want to give the user the ability to > enable those vfs and trigger that tainting? I saw the route through > sysfs to enable sriov on a user owned device as a feature because it > would require an existing privilege path rather than creating one > through vfio. For example, an unprivileged QEMU could submit a > request to a privileged entity to enable sriov through sysfs. If we > wanted to allow the user to enable sriov directly, the user would at > least need to run with privileges like CAP_SYS_ADMIN. The issue in all this ends up being how to handle the synchronization between userspace and the kernel. That was the only reason why I was thinking of that. An alternative thought I had was to look at just making it so that we did a minimal vfio version of sriov_configure that only allowed changing the number of VFs when the device was not already in use, specifically if the vdev->refcnt was non-zero. So the usage in such a case would be to bind the driver, set the number of VFs, and then start the application. On driver unload we would then call pci_disable_sriov and just clean everything up on the way out. > Side note: we could just enable emulation of the sriov capability > through config space in vfio-pci if we wanted to allow that, rather > than adding a new ioctl. That seems like a really messy way of doing things though since there isn't really any way to report exceptions other than just refusing to set the bit and for all we know that could be the result of a hardware error. >> >> >> My thought is to look at making SR-IOV configuration static or treat >> >> >> it as read-only when the vfio-pci driver is loaded on a given >> >> >> interface. In addition I would set the TAINT_USER flag and add a >> >> >> warning about loading vfio-pci on an active PF, and provide the number >> >> >> of VFs that were allocated. >> >> >> >> >> >> The idea with all of this being that we would at least have a partial >> >> >> lock on all of this so that you can allocate some number of VFs, and >> >> >> then start partitioning things up where you could assign the PF and >> >> >> VFs as needed to the various interfaces. Once the PF is assigned to >> >> >> the vfio-pci driver it would be locked in terms of the number of VFs >> >> >> that are provided so there would be no need for an extra communication >> >> >> channel between the PF driver and the host to deal with changes. >> >> >> >> >> >> Thoughts? >> >> > >> >> > Is an sriov configuration ever static? vfio is the most prolific user >> >> > of the pci reset interfaces, AFAIK. Even if we add support for >> >> > restoring the sriov configuration and even if the pf user isn't trying >> >> > to be malicious, vf users would need to be prepared for their device >> >> > arbitrarily dropping off the bus, mmio randomly (from their >> >> > perspective) being disabled, and maybe other side-effects of the user >> >> > just trying to use the pf device. Is there even any guarantee that a >> >> > pf driver can operate identically, regardless of the state of sriov on >> >> > the pf? It seems like the pf driver needs to be aware of which queues >> >> > are available for itself vs the vfs in some designs. Is there still >> >> > long term value in a solution for this if the kernel is tainted? >> >> >> >> As far as the device randomly resetting I don't really see how that is >> >> any different from what we already have in terms of solutions >> >> including stuff supported by in-driver SR-IOV. The Intel drivers are >> >> notorious for resetting for any little thing like an MTU change. Us >> >> resetting due to someone calling open/release on an interface wouldn't >> >> be anything new. In addition AER can always hit us too. Yes we don't >> >> have the same recovery mechanisms in place, but in the case of >> >> something such as this we probably don't need that complex of a >> >> recovery setup. >> > >> > [I think below paragraph is specifically answering last question above, >> > still long term value in kernel tainting solution] >> > >> >> I would think there probably is. If not we wouldn't have gotten the >> >> patches earlier that were still doing the tainting and warning and >> >> making use of the vfio-pci driver to enable SR-IOV. I suspect the use >> >> case for all this is to enable SR-IOV, setup the PF in vfio-pci, and >> >> then assign VFs into and out of guests. I don't see the PF doing a lot >> >> of moving after it is setup. The taint flag only really applies if >> >> someone is looking for support and quite honestly I figure for now the >> >> USER flag is appropriate for this kind of thing since we are deferring >> >> all the networking control to the PF which is managed by userspace. >> > >> > If we're going to throw up our hands and taint the kernel, then maybe >> > we could entertain the idea of doing that in a trivial vfio-pci >> > sriov_configure function. But does that really meet all the use cases >> > and what's the advantage of that vs Amazon (I guess they're the driver >> > here) tainting the kernel with their own driver? Mellanox has tried to >> > enable sriov in vfio-pci in the past, Cc Ilya. >> > >> >> > Minimally, it seems like the pf driver (the user driver, not vfio-pci) >> >> > needs to be a cooperating party in this exchange. How do we determine >> >> > that and what QoS guarantees are they providing by agreeing to it? >> >> >> >> I think the general idea here is that the user has already tested this >> >> out and determined for themselves that the user driver does what they >> >> want/need. If it didn't they wouldn't be going to all these lengths to >> >> set this up. >> >> >> >> > Otherwise it doesn't seem like sriov provides the sort of static, hard >> >> > partitioning of the device that you're asking for. A vf is a piece of >> >> > the pf and we've released ownership and control of the pf to a user. >> >> > Thanks, >> >> > >> >> > Alex >> >> >> >> I am pretty sure that you are describing is true of some, but not for >> >> all. I think the Amazon solutions and the virtio solution are doing >> >> hard partitioning of the part. I will leave it to those guys to speak >> >> for themselves since I don't know anything about the hardware design >> >> of those parts. >> > >> > I think we'd need device specific knowledge and enablement to be able >> > to take advantage of any hardware partitioning, otherwise we need to >> > assume the pf is privileged, as implemented in other sriov devices. >> >> What we would probably need to end up doing is coming up with a >> whitelist of parts that support SR-IOV via a unprivileged PF. Maybe >> some sort of PCI quirk that would add a new feature flag that says we >> could do this without tainting the kernel. As-is this kind of thing >> would still be very experimental/new for the kernel so to some extent >> using the taint flag is also reflecting that since I don't know how >> common any use case like this is at this point or how well it has been >> tested. > > I guess I'm not convinced that "supported sriov via an unprivileged pf" > is a thing that exists in this universe. It sounds like something a > vendor would claim to have because they want this to work regardless of > whether it's true or the right thing to do. Even if the pf can only > trigger an flr and that momentarily makes the vfs blip out and back, a > malicious user can do that in a loop and disrupt any useful application > of the vfs. Thanks, > > Alex I'm not saying the PF is unprivileged, it just doesn't have all that much privilege in these cases. I think you have sighted most of the big "privileges" that should impact things, specifically the function resets and the SR-IOV capability. But as far as actually doing anything like handling global resources there is a firmware engine that is usually handling things like link, queues, interrupts, and such. Anyway that is the firmware case which is only half the problem. The other side of this is the userspace/SDN guys who are essentially trying to turn the VMs into network infrastructure and that is where requests like the userspace PF comes from so that they can get something like DPDK running on it so that it can act as some piece of NFV infrastructure such as a firewall or accelerated OVS switch. Usually the case is if the PF goes down the VFs are going down one way or another since you aren't getting traffic through either a disabled firewall or switch.
> From: Alex Williamson > Sent: Friday, March 2, 2018 4:22 AM > > > > I am pretty sure that you are describing is true of some, but not for > > all. I think the Amazon solutions and the virtio solution are doing > > hard partitioning of the part. I will leave it to those guys to speak > > for themselves since I don't know anything about the hardware design > > of those parts. > > I think we'd need device specific knowledge and enablement to be able > to take advantage of any hardware partitioning, otherwise we need to > assume the pf is privileged, as implemented in other sriov devices. > > I'm also trying to imagine whether there's a solution via the new > vfio/mdev interface, where the mdev vendor driver would bind to the pf > and effectively present itself as the mdev device. The vendor driver > could provide sriov capabilities and bear the burden of ensuring that > the pf is used cooperatively. The only existing mdev vendor drivers are > vGPUs and rely on on-device DMA translation and isolation, such as > through GTTs, but there have been some thoughts on providing IOMMU > based > isolation of mdev/sriov mixed devices (assuming DMA is even required > for userspace management of the pf in this use case). [Cc Kirti] > Thanks, > Hope not distracting this thread, but above sounds like an interesting idea. Actually we ever brainstormed similar thought for another potential usage - supporting VF live migration. We are already working on an generic extension to allow state save/restore of mdev instance. If vendor driver could further wrap pf as a mdev instance, it could leverage the same framework for a clean state migration on VF. based on mmap callback the vendor driver can easily switch back-and-forth between pass through and trap/emulation of the VF resources. Of course doing so alone doesn't address all the demands of VF live migration (e.g. dirty page tracking still requires other techniques), but it does pave a way toward a general framework to support VF live migration. If above is feasible, finally we could use one mdev framework to manage both mdev and pf/vf assignment, while providing added values which are difficult to achieve today. :-) Thanks Kevin
> From: Tian, Kevin > Sent: Friday, March 2, 2018 2:54 PM > > > From: Alex Williamson > > Sent: Friday, March 2, 2018 4:22 AM > > > > > > I am pretty sure that you are describing is true of some, but not for > > > all. I think the Amazon solutions and the virtio solution are doing > > > hard partitioning of the part. I will leave it to those guys to speak > > > for themselves since I don't know anything about the hardware design > > > of those parts. > > > > I think we'd need device specific knowledge and enablement to be able > > to take advantage of any hardware partitioning, otherwise we need to > > assume the pf is privileged, as implemented in other sriov devices. > > > > I'm also trying to imagine whether there's a solution via the new > > vfio/mdev interface, where the mdev vendor driver would bind to the pf > > and effectively present itself as the mdev device. The vendor driver > > could provide sriov capabilities and bear the burden of ensuring that > > the pf is used cooperatively. The only existing mdev vendor drivers are > > vGPUs and rely on on-device DMA translation and isolation, such as > > through GTTs, but there have been some thoughts on providing IOMMU > > based > > isolation of mdev/sriov mixed devices (assuming DMA is even required > > for userspace management of the pf in this use case). [Cc Kirti] > > Thanks, > > > > Hope not distracting this thread, but above sounds like an interesting > idea. Actually we ever brainstormed similar thought for another > potential usage - supporting VF live migration. We are already working > on an generic extension to allow state save/restore of mdev instance. > If vendor driver could further wrap pf as a mdev instance, it could I meant "wrap vf as a mdev instance" here. > leverage the same framework for a clean state migration on VF. based > on mmap callback the vendor driver can easily switch back-and-forth > between pass through and trap/emulation of the VF resources. Of > course doing so alone doesn't address all the demands of VF live > migration (e.g. dirty page tracking still requires other techniques), > but it does pave a way toward a general framework to support VF > live migration. > > If above is feasible, finally we could use one mdev framework to > manage both mdev and pf/vf assignment, while providing added > values which are difficult to achieve today. :-) > > Thanks > Kevin
On Thu, 1 Mar 2018 18:49:53 -0800 Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Thu, Mar 1, 2018 at 3:58 PM, Alex Williamson > <alex.williamson@redhat.com> wrote: > > On Thu, 1 Mar 2018 14:42:40 -0800 > > Alexander Duyck <alexander.duyck@gmail.com> wrote: > > > >> On Thu, Mar 1, 2018 at 12:22 PM, Alex Williamson > >> <alex.williamson@redhat.com> wrote: > >> > On Wed, 28 Feb 2018 16:36:38 -0800 > >> > Alexander Duyck <alexander.duyck@gmail.com> wrote: > >> > > >> >> On Wed, Feb 28, 2018 at 2:59 PM, Alex Williamson > >> >> <alex.williamson@redhat.com> wrote: > >> >> > On Wed, 28 Feb 2018 09:49:21 -0800 > >> >> > Alexander Duyck <alexander.duyck@gmail.com> wrote: > >> >> > > >> >> >> On Tue, Feb 27, 2018 at 2:25 PM, Alexander Duyck > >> >> >> <alexander.duyck@gmail.com> wrote: > >> >> >> So I was thinking about this some more. In the case of vfio-pci things > >> >> >> are a bit different since you are essentially taking a given device > >> >> >> and handing it to a VM or userspace and it doesn't guarantee a > >> >> >> communication between the two. > >> >> > > >> >> > Doesn't guarantee communication or cooperation or even trust. > >> >> > >> >> Right, but at the same time I consider this to be a shoot yourself in > >> >> the foot type scenario. If you are going to hand off your PF while VFs > >> >> are active then you are asking for whatever repercussions you end up > >> >> getting. I've added a warning and a TAINT_USER flag to my code at this > >> >> point if you load an active PF in vfio, and I have added a function > >> >> that locks the setting so it cannot be changed once we place a PF in > >> >> the control of vfio-pci. > >> >> > >> >> The way I see it there are two scenarios. One where the PF is just a > >> >> VF with an extra bit of SR-IOV configuration space floating off of it. > >> >> The other is where we want to have SR-IOV enabled and have some third > >> >> party managing the PF. The first one doesn't really care about the > >> >> communication and would prefer that whatever owns the driver on the PF > >> >> just ignore the SR-IOV portion of the configuration space. The second > >> >> one actually cares and would want some sort of > >> >> communication/cooperation but for now I don't see that as being the > >> >> easiest thing to do so it might be best to just let it see a fixed > >> >> number of VFs it just has to work with that. > >> > > >> > There's no such thing as a read-only sriov capability in the spec, > >> > which is a problem we ran into a while ago, vfio-pci exposes the > >> > capability, but protects it as read-only so the user cannot create > >> > devices on the host. QEMU passed this through to the guest, but that > >> > failed as soon as OVMF started supporting sriov as it's unable to size > >> > the VF BAR resources. Now QEMU drops the sriov capability from the > >> > guest capability chain. So it's not clear how this fixed number of vfs > >> > plan works. Are we inventing our own capability for this? If the pf > >> > driver is just a userspace driver and not a VM, maybe what we do now is > >> > sufficient, otherwise there's no standard for exposing fixed vfs. > >> > >> So one thought I had is what if we provide an ioctl command that > >> allows for setting the number of VFs through VFIO? Basically it > >> wouldn't be exposed through the sysfs, but would be totally controlled > >> by the userspace using the sriov_configure_unmanged call to > >> enable/disable the VFs and ultimately cleaning it up when the device > >> it is released. I would imagine all the Amazon guys were looking for > >> is something where they could just have some sort of command that > >> could reach through and manage their VF count from some daemon. If > >> they assigned the interface to a guest it would not be able to touch > >> the SR-IOV capability unless QEMU went and implemented it which likely > >> isn't happening anytime soon. It would provide the userspace guys a > >> way to allocate the VF resources, call this ioctl, and then start > >> managing VFs in the case of a userspace managed entity instead of it > >> just being a firmware or preallocated type SR-IOV configuration. > > > > Wait a sec, we were considering that a user owned pf sourcing vfs would > > taint the host kernel and now we want to give the user the ability to > > enable those vfs and trigger that tainting? I saw the route through > > sysfs to enable sriov on a user owned device as a feature because it > > would require an existing privilege path rather than creating one > > through vfio. For example, an unprivileged QEMU could submit a > > request to a privileged entity to enable sriov through sysfs. If we > > wanted to allow the user to enable sriov directly, the user would at > > least need to run with privileges like CAP_SYS_ADMIN. > > The issue in all this ends up being how to handle the synchronization > between userspace and the kernel. That was the only reason why I was > thinking of that. > > An alternative thought I had was to look at just making it so that we > did a minimal vfio version of sriov_configure that only allowed > changing the number of VFs when the device was not already in use, > specifically if the vdev->refcnt was non-zero. So the usage in such a > case would be to bind the driver, set the number of VFs, and then > start the application. On driver unload we would then call > pci_disable_sriov and just clean everything up on the way out. Yep, aside from asking Amazon to submit a native host driver, this is the only option I'm seeing as well. Of course vfio would need to taint the kernel since it's not doing anything to restrict the usage of those potentially compromised vfs, and then we need to think about how much code we're willing to introduce and maintain to enable unsupported use cases. > > Side note: we could just enable emulation of the sriov capability > > through config space in vfio-pci if we wanted to allow that, rather > > than adding a new ioctl. > > That seems like a really messy way of doing things though since there > isn't really any way to report exceptions other than just refusing to > set the bit and for all we know that could be the result of a hardware > error. Re-reading a register doesn't seem so terrible. Of course, unless we can solve the tainting problem, it doesn't matter. Either an ioctl or emulation is more code than I'm willing to maintain for a taint-only interface. Thanks, Alex
On Fri, 2 Mar 2018 06:54:17 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Alex Williamson > > Sent: Friday, March 2, 2018 4:22 AM > > > > > > I am pretty sure that you are describing is true of some, but not for > > > all. I think the Amazon solutions and the virtio solution are doing > > > hard partitioning of the part. I will leave it to those guys to speak > > > for themselves since I don't know anything about the hardware design > > > of those parts. > > > > I think we'd need device specific knowledge and enablement to be able > > to take advantage of any hardware partitioning, otherwise we need to > > assume the pf is privileged, as implemented in other sriov devices. > > > > I'm also trying to imagine whether there's a solution via the new > > vfio/mdev interface, where the mdev vendor driver would bind to the pf > > and effectively present itself as the mdev device. The vendor driver > > could provide sriov capabilities and bear the burden of ensuring that > > the pf is used cooperatively. The only existing mdev vendor drivers are > > vGPUs and rely on on-device DMA translation and isolation, such as > > through GTTs, but there have been some thoughts on providing IOMMU > > based > > isolation of mdev/sriov mixed devices (assuming DMA is even required > > for userspace management of the pf in this use case). [Cc Kirti] > > Thanks, > > > > Hope not distracting this thread, but above sounds like an interesting > idea. Actually we ever brainstormed similar thought for another > potential usage - supporting VF live migration. We are already working > on an generic extension to allow state save/restore of mdev instance. > If vendor driver could further wrap pf as a mdev instance, it could > leverage the same framework for a clean state migration on VF. based > on mmap callback the vendor driver can easily switch back-and-forth > between pass through and trap/emulation of the VF resources. Of > course doing so alone doesn't address all the demands of VF live > migration (e.g. dirty page tracking still requires other techniques), > but it does pave a way toward a general framework to support VF > live migration. > > If above is feasible, finally we could use one mdev framework to > manage both mdev and pf/vf assignment, while providing added > values which are difficult to achieve today. :-) mdev drivers may be the first to support migration, but I wonder if a full mdev implementation is necessary for it. Once the vfio api is define, device specific extensions to vfio-pci might be able to implement migration more directly. Thanks, Alex
> From: Alex Williamson [mailto:alex.williamson@redhat.com] > Sent: Saturday, March 3, 2018 2:14 AM > > On Fri, 2 Mar 2018 06:54:17 +0000 > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > From: Alex Williamson > > > Sent: Friday, March 2, 2018 4:22 AM > > > > > > > > I am pretty sure that you are describing is true of some, but not for > > > > all. I think the Amazon solutions and the virtio solution are doing > > > > hard partitioning of the part. I will leave it to those guys to speak > > > > for themselves since I don't know anything about the hardware design > > > > of those parts. > > > > > > I think we'd need device specific knowledge and enablement to be able > > > to take advantage of any hardware partitioning, otherwise we need to > > > assume the pf is privileged, as implemented in other sriov devices. > > > > > > I'm also trying to imagine whether there's a solution via the new > > > vfio/mdev interface, where the mdev vendor driver would bind to the > pf > > > and effectively present itself as the mdev device. The vendor driver > > > could provide sriov capabilities and bear the burden of ensuring that > > > the pf is used cooperatively. The only existing mdev vendor drivers are > > > vGPUs and rely on on-device DMA translation and isolation, such as > > > through GTTs, but there have been some thoughts on providing IOMMU > > > based > > > isolation of mdev/sriov mixed devices (assuming DMA is even required > > > for userspace management of the pf in this use case). [Cc Kirti] > > > Thanks, > > > > > > > Hope not distracting this thread, but above sounds like an interesting > > idea. Actually we ever brainstormed similar thought for another > > potential usage - supporting VF live migration. We are already working > > on an generic extension to allow state save/restore of mdev instance. > > If vendor driver could further wrap pf as a mdev instance, it could > > leverage the same framework for a clean state migration on VF. based > > on mmap callback the vendor driver can easily switch back-and-forth > > between pass through and trap/emulation of the VF resources. Of > > course doing so alone doesn't address all the demands of VF live > > migration (e.g. dirty page tracking still requires other techniques), > > but it does pave a way toward a general framework to support VF > > live migration. > > > > If above is feasible, finally we could use one mdev framework to > > manage both mdev and pf/vf assignment, while providing added > > values which are difficult to achieve today. :-) > > mdev drivers may be the first to support migration, but I wonder if a > full mdev implementation is necessary for it. Once the vfio api is > define, device specific extensions to vfio-pci might be able to > implement migration more directly. Thanks, > > Alex yes technically a full mdev implementation is not necessary. If device specific extensions will be placed within vfio module, it's obviously straightforward. What I thought earlier was in case vfio wants to stay device-agnostic then we probably want device specific extensions in vendor driver which is loaded but in a dummy mode which simply do basic PCI initialization as vfio-pci and then wrap vf as mdev (since vfio-pci is not the vf driver in this scenario). It's especially useful for vendor drivers which aim to support both mdev and sr-iov by sharing common state mgmt. knowledge, but looks an overkill to other vendor drivers. Possibly finally we'll allow both - simple device extensions in vfio-pci and complex device extensions in vendor drivers through vfio-mdev. Thanks Kevin
On 03/01/2018 03:22 PM, Alex Williamson wrote: > On Wed, 28 Feb 2018 16:36:38 -0800 > Alexander Duyck <alexander.duyck@gmail.com> wrote: > >> On Wed, Feb 28, 2018 at 2:59 PM, Alex Williamson >> <alex.williamson@redhat.com> wrote: >>> On Wed, 28 Feb 2018 09:49:21 -0800 >>> Alexander Duyck <alexander.duyck@gmail.com> wrote: >>> >>>> On Tue, Feb 27, 2018 at 2:25 PM, Alexander Duyck >>>> <alexander.duyck@gmail.com> wrote: >>>>> On Tue, Feb 27, 2018 at 1:40 PM, Alex Williamson >>>>> <alex.williamson@redhat.com> wrote: >>>>>> On Tue, 27 Feb 2018 11:06:54 -0800 >>>>>> Alexander Duyck <alexander.duyck@gmail.com> wrote: >>>>>> >>>>>>> From: Alexander Duyck <alexander.h.duyck@intel.com> >>>>>>> >>>>>>> This patch is meant to add support for SR-IOV on devices when the VFs are >>>>>>> not managed by the kernel. Examples of recent patches attempting to do this >>>>>>> include: >>>>>> >>>>>> It appears to enable sriov when the _pf_ is not managed by the >>>>>> kernel, but by "managed" we mean that either there is no pf driver or >>>>>> the pf driver doesn't provide an sriov_configure callback, >>>>>> intentionally or otherwise. >>>>>> >>>>>>> virto - https://patchwork.kernel.org/patch/10241225/ >>>>>>> pci-stub - https://patchwork.kernel.org/patch/10109935/ >>>>>>> vfio - https://patchwork.kernel.org/patch/10103353/ >>>>>>> uio - https://patchwork.kernel.org/patch/9974031/ >>>>>> >>>>>> So is the goal to get around the issues with enabling sriov on each of >>>>>> the above drivers by doing it under the covers or are you really just >>>>>> trying to enable sriov for a truly unmanage (no pf driver) case? For >>>>>> example, should a driver explicitly not wanting sriov enabled implement >>>>>> a dummy sriov_configure function? >>>>>> >>>>>>> Since this is quickly blowing up into a multi-driver problem it is probably >>>>>>> best to implement this solution in one spot. >>>>>>> >>>>>>> This patch is an attempt to do that. What we do with this patch is provide >>>>>>> a generic call to enable SR-IOV in the case that the PF driver is either >>>>>>> not present, or the PF driver doesn't support configuring SR-IOV. >>>>>>> >>>>>>> A new sysfs value called sriov_unmanaged_autoprobe has been added. This >>>>>>> value is used as the drivers_autoprobe setting of the VFs when they are >>>>>>> being managed by an external entity such as userspace or device firmware >>>>>>> instead of being managed by the kernel. >>>>>> >>>>>> Documentation/ABI/testing/sysfs-bus-pci update is missing. >>>>> >>>>> I can make sure to update that in the next version. >>>>> >>>>>>> One side effect of this change is that the sriov_drivers_autoprobe and >>>>>>> sriov_unmanaged_autoprobe will only apply their updates when SR-IOV is >>>>>>> disabled. Attempts to update them when SR-IOV is in use will only update >>>>>>> the local value and will not update sriov->autoprobe. >>>>>> >>>>>> And we expect users to understand when sriov_drivers_autoprobe applies >>>>>> vs sriov_unmanaged_autoprobe, even though they're using the same >>>>>> interfaces to enable sriov? Are all combinations expected to work, ex. >>>>>> unmanaged sriov is enabled, a native pf driver loads, vfs work? Not >>>>>> only does it seems like there's opportunity to use this incorrectly, I >>>>>> think maybe it might be difficult to use correctly. >>>>>> >>>>>>> I based my patch set originally on the patch by Mark Rustad but there isn't >>>>>>> much left after going through and cleaning out the bits that were no longer >>>>>>> needed, and after incorporating the feedback from David Miller. >>>>>>> >>>>>>> I have included the authors of the original 4 patches above in the Cc here. >>>>>>> My hope is to get feedback and/or review on if this works for their use >>>>>>> cases. >>>>>>> >>>>>>> Cc: Mark Rustad <mark.d.rustad@intel.com> >>>>>>> Cc: Maximilian Heyne <mheyne@amazon.de> >>>>>>> Cc: Liang-Min Wang <liang-min.wang@intel.com> >>>>>>> Cc: David Woodhouse <dwmw@amazon.co.uk> >>>>>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >>>>>>> --- >>>>>>> drivers/pci/iov.c | 27 +++++++++++++++++++- >>>>>>> drivers/pci/pci-driver.c | 2 + >>>>>>> drivers/pci/pci-sysfs.c | 62 +++++++++++++++++++++++++++++++++++++++++----- >>>>>>> drivers/pci/pci.h | 4 ++- >>>>>>> include/linux/pci.h | 1 + >>>>>>> 5 files changed, 86 insertions(+), 10 deletions(-) >>>>>>> >>>>>>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c >>>>>>> index 677924ae0350..7b8858bd8d03 100644 >>>>>>> --- a/drivers/pci/iov.c >>>>>>> +++ b/drivers/pci/iov.c >>>>>>> @@ -446,6 +446,7 @@ static int sriov_init(struct pci_dev *dev, int pos) >>>>>>> pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, &iov->vf_device); >>>>>>> iov->pgsz = pgsz; >>>>>>> iov->self = dev; >>>>>>> + iov->autoprobe = true; >>>>>>> iov->drivers_autoprobe = true; >>>>>>> pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, &iov->cap); >>>>>>> pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, &iov->link); >>>>>>> @@ -643,8 +644,11 @@ void pci_restore_iov_state(struct pci_dev *dev) >>>>>>> */ >>>>>>> void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool auto_probe) >>>>>>> { >>>>>>> - if (dev->is_physfn) >>>>>>> + if (dev->is_physfn) { >>>>>>> dev->sriov->drivers_autoprobe = auto_probe; >>>>>>> + if (!dev->sriov->num_VFs) >>>>>>> + dev->sriov->autoprobe = auto_probe; >>>>>> >>>>>> Why is dev->sriov->autoprobe set any time other than immediately prior >>>>>> to enabling VFs? >>>>> >>>>> My concern here was drivers that are still floating around with the >>>>> old module parameter option for enabling SR-IOV. In the unlikely event >>>>> that somebody was to use such a driver I wanted to make certain that >>>>> the drivers_autoprobe state was pre-populated. >>> >>> Good point, but maybe that just means we should be setting it in >>> sriov_enable()? >> >> I suppose we could. I would just have to check and see if we have any >> drivers lurking out there that are supporting SR-IOV without >> supporting the sysfs approach. As long as that is the case we could >> probably put it there. >> >>>>>>> + } >>>>>>> } >>>>>>> >>>>>>> /** >>>>>>> @@ -703,6 +707,27 @@ void pci_disable_sriov(struct pci_dev *dev) >>>>>>> EXPORT_SYMBOL_GPL(pci_disable_sriov); >>>>>>> >>>>>>> /** >>>>>>> + * pci_sriov_configure_unmanaged - helper to configure unmanaged SR-IOV >>>>>>> + * @dev: the PCI device >>>>>>> + * @nr_virtfn: number of virtual functions to enable, 0 to disable >>>>>>> + * >>>>>>> + * Used to provide generic enable/disable SR-IOV option for devices >>>>>>> + * that do not manage the VFs generated by their driver, or have no >>>>>>> + * driver present. >>>>>>> + */ >>>>>>> +int pci_sriov_configure_unmanaged(struct pci_dev *dev, int nr_virtfn) >>>>>>> +{ >>>>>>> + int rc = 0; >>>>>>> + >>>>>>> + if (!nr_virtfn) >>>>>>> + pci_disable_sriov(dev); >>>>>>> + else >>>>>>> + rc = pci_enable_sriov(dev, nr_virtfn); >>>>>>> + >>>>>>> + return rc ? rc : nr_virtfn; >>>>>>> +} >>>>>>> + >>>>>>> +/** >>>>>>> * pci_num_vf - return number of VFs associated with a PF device_release_driver >>>>>>> * @dev: the PCI device >>>>>>> * >>>>>>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c >>>>>>> index 3bed6beda051..2cc68dff6130 100644 >>>>>>> --- a/drivers/pci/pci-driver.c >>>>>>> +++ b/drivers/pci/pci-driver.c >>>>>>> @@ -398,7 +398,7 @@ void __weak pcibios_free_irq(struct pci_dev *dev) >>>>>>> #ifdef CONFIG_PCI_IOV >>>>>>> static inline bool pci_device_can_probe(struct pci_dev *pdev) >>>>>>> { >>>>>>> - return (!pdev->is_virtfn || pdev->physfn->sriov->drivers_autoprobe); >>>>>>> + return (!pdev->is_virtfn || pdev->physfn->sriov->autoprobe); >>>>>>> } >>>>>>> #else >>>>>>> static inline bool pci_device_can_probe(struct pci_dev *pdev) >>>>>>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c >>>>>>> index eb6bee8724cc..e701b6dbc267 100644 >>>>>>> --- a/drivers/pci/pci-sysfs.c >>>>>>> +++ b/drivers/pci/pci-sysfs.c >>>>>>> @@ -605,6 +605,7 @@ static ssize_t sriov_numvfs_store(struct device *dev, >>>>>>> struct device_attribute *attr, >>>>>>> const char *buf, size_t count) >>>>>>> { >>>>>>> + int (*sriov_configure)(struct pci_dev *dev, int num_vfs); >>>>>>> struct pci_dev *pdev = to_pci_dev(dev); >>>>>>> int ret; >>>>>>> u16 num_vfs; >>>>>>> @@ -622,15 +623,20 @@ static ssize_t sriov_numvfs_store(struct device *dev, >>>>>>> goto exit; >>>>>>> >>>>>>> /* is PF driver loaded w/callback */ >>>>>>> - if (!pdev->driver || !pdev->driver->sriov_configure) { >>>>>>> - pci_info(pdev, "Driver doesn't support SRIOV configuration via sysfs\n"); >>>>>>> - ret = -ENOENT; >>>>>>> - goto exit; >>>>>>> - } >>>>>>> + if (pdev->driver && pdev->driver->sriov_configure) >>>>>>> + sriov_configure = pdev->driver->sriov_configure; >>>>>>> + else >>>>>>> + sriov_configure = pci_sriov_configure_unmanaged; >>>>>> >>>>>> So an unwitting user is now able to enable vfs, independent of the >>>>>> pf... the trouble being that they're probably going to expect them to >>>>>> work and the more common case is that they won't. For instance, what >>>>>> can you do with an igbvf when igb isn't managing the pf? >>>>> >>>>> Well the VFs wouldn't be able to do anything. Basically they would be >>>>> sitting there with no driver loaded on them unless they are assigned >>>>> to a guest, or the root user had enabled the unmanaged option. If you >>>>> did load a driver on it the VF would sit there with link down unless >>>>> either the PF driver is loaded or some user-space entity steps in to >>>>> start managing the PF. >>>>> >>>>> In reality this can already happen as last I recall igb and ixgbe were >>>>> already capable of having the PF driver removed when SR-IOV was >>>>> enabled and VFs were assigned. Basically the VFs just report link down >>>>> and don't do anything. Reloading the PF driver would have it take over >>>>> in the case of igb and ixgbe since they were designed to handle that >>>>> type of scenario. >>> >>> I think only ixgbe behaves this way between the two, igb disables sriov >>> in its remove function unless we're hung up by that silly >>> pci_vfs_assigned() check, which doesn't apply to vfio assignment. >>> Regardless, yes, some pf drivers do leave sriov enabled on remove, >>> whether that's useful or reasonable is another question. >> >> Right. We can argue that another day. I was just sighting the igb behavior. >> >>>>>> Or what happens when vfio-pci owns the pf, sriov is enabled via the >>>>>> unmanaged interface, and the pf user driver segfaults and gets killed, >>>>>> causing vfio-pci to restore the pf state, including wiping the sriov >>>>>> config? >>>>> >>>>> Wiping the config shouldn't have any effect on the allocated VF pci >>>>> devices. It will cause the VFs to act as though they have fallen off >>>>> of the bus though and the guests would see a surprise remove type >>>>> behavior if I am not mistaken. The MMIO and config accesses would fail >>>>> until the SR-IOV configuration is restored. Really this shouldn't be a >>>>> problem as long as the SR-IOV is enabled prior to loading the vfio-pci >>>>> driver as I would assume it would restore the state an re-enable >>>>> SR-IOV. >>> >>> Nope, sriov does not appear to be part of the device state saved and >>> restored around reset, so I think we just end up in an inconsistent >>> state. Also, vfio-pci stores a copy of the saved state of a device >>> prior to giving the user access to the device and tries to restore that >>> copy when released by the user, so anything that's part of the save >>> state and modified via side-channels while the device is opened, would >>> still be lost. >> >> Well it is but it isn't. The pci_restore_state will call >> pci_restore_iov_state which will just repopulate the data straight out >> of the pdev->sriov structure. I think the assumption is we were >> already carrying enough information that there wasn't any point in >> saving the state since we could just restore it from that structure >> anyway. > > Ah, right, I remember that now. > >> I've been chasing it down and can verify that much this morning after >> testing. A regular hit of the sysfs reset control will not erase the >> state. > > Perhaps long term, but momentarily the vf disappears and that could > imply data loss, no? > >> One issue that I think I found that may be a bug though is if I >> load the vfio-pci driver on an interface and unbind it I am seeing the >> port state reset. I have it chased down to the idle D3 code. It looks >> like going from vfio_pci_probe to vfio_pci_remove is triggering the >> equivilent of the pci_pm_reset since it is cycling me through >> D0->D3->D0 without restoring state after the fact. I also verified >> that setting disable_idle_d3 resolves the issue. Would you have any >> complaints about me doing a save_state in the probe call, and a >> restore_state in the remove? It seems like that should probably be the >> easiest fix. > > Hmm, perhaps I had assumed that pci_set_power_state() would handle such > things, but indeed I do see other drivers calling pci_save_state() > prior and pci_restore_state() after. I think we'd need to audit all of > vfio-pci's calls to pci_set_power_state(), not just probe and remove. > >>>>> In the grand scheme of things how would the situation you describe be >>>>> any different than someone using the "reset" sysfs control on the PF >>>>> while using SR-IOV with driver supported SR-IOV? >>> >>> Similar, but the pf driver has enabled sriov and can register the >>> reset_done error handler to be notified to re-enable sriov or perform >>> other cleanup. If the pf driver is not participating in sriov, it >>> would seem exceptional to support this. Triggering reset via sysfs >>> also falls more closely to shooting yourself in the foot than I think >>> we want to design into driver/subsystem APIs. >> >> To some extent I agree with the shooting yourself in the foot. At the >> same time there isn't actually all that much to re-enable SR-IOV. >> Restoring the PCI SR-IOV configuration space and re-enabling the bus >> master enable. > > Bus master is a pretty interesting example of the trivial control a pf > driver can exert on the vfs though. > >> The Amazon guys would probably know better than I since I haven't >> really worked much with one of these parts yet. Actually the virtio >> that Mark pushed may behave the same way too. As far as I know in >> these firmware cases the hardware itself has everything >> pre-partitioned and set to re-enable as soon as the SR-IOV bits are >> set. I think all they need is a few bits toggled and they are pretty >> much good to go. > > Are they just looking for an sriov capable stub driver? With > increasing vf counts, being able to use something like vfio-pci on the > pf seems like all risk with statistically insignificant increase in > density. On the other hand, if there's a userspace pf management > driver, why not just make it trusted by adding it as a native host > kernel driver? If we're talking about tainting the host kernel to > enable this interaction, maybe it should just be tainted by an out of > tree, possibly non-gpl host pf driver anyway. There can't really be a > case where the pf would be used by an average user without some degree > of privilege or cooperation, right? > >>>>> I suppose if you really wanted we could add a new call that you could >>>>> put into the sriov_configure pointer that would just make it always >>>>> return error. Then that way the configuration could be locked until >>>>> the driver is unloaded. >>>>> >>>>>> I guess I don't understand how vfs can operate fully independent of the >>>>>> pf and why vfio-pci wouldn't just implement a dummy sriov_configure to >>>>>> avoid contending with such issues. >>>>> >>>>> This approach isn't without risk, but it isn't as if it is really a >>>>> new risk we are introducing, and we could do further work to help >>>>> smooth out issues like this. Much of this depends on the design of the >>>>> device and the drivers involved. What is happening in this case is >>>>> that the VFs are managed outside of the kernel itself either by some >>>>> user-space entity or by a firmware entity. >>> >>> But we can't simply split vfs and pfs as separate resources, the vf is >>> dependent on the pf and that implies a privilege hierarchy, if not at >>> least cooperation. The pf driver can intentionally or unintentionally >>> disconnect the vfs at any point in time, possibly leading to data loss >>> or at least denial of service for the vf consumer. I also don't trust >>> that there aren't numerous sriov implementations where the pf isn't >>> privileged to the point of being able to snoop data from the vf. So >>> what sort of usage models are we enabling for the vf? A firmware owned >>> vf subject to these conditions seems mostly pointless. >> >> In the case of these sort of parts the PF isn't really all that >> privileged. Usually the PF is just a VF with the SR-IOV capability >> hanging off of it. I suspect the only thing that might outright >> control anything would be the bus master enable bit. Everything else >> could be independent. The PF driver in such cases doesn't do much. It >> is basically just the host for the configuration space. > > This is sounding more like an sriov capable stub driver. Certainly not > all pfs behave the way you describe above, many are significantly more > privileged and even if not, bus master is a pretty trivial control > point. So we probably need a driver that claims devices known to > behave this way, or a meta driver that bind via dynamic IDs or > driver_override. The proposal here sort of covertly turns anything > that's not an sriov driver into that stub driver with no guarantee that > the subverted driver is a willing or safe participant. > >>>> So I was thinking about this some more. In the case of vfio-pci things >>>> are a bit different since you are essentially taking a given device >>>> and handing it to a VM or userspace and it doesn't guarantee a >>>> communication between the two. >>> >>> Doesn't guarantee communication or cooperation or even trust. >> >> Right, but at the same time I consider this to be a shoot yourself in >> the foot type scenario. If you are going to hand off your PF while VFs >> are active then you are asking for whatever repercussions you end up >> getting. I've added a warning and a TAINT_USER flag to my code at this >> point if you load an active PF in vfio, and I have added a function >> that locks the setting so it cannot be changed once we place a PF in >> the control of vfio-pci. >> >> The way I see it there are two scenarios. One where the PF is just a >> VF with an extra bit of SR-IOV configuration space floating off of it. >> The other is where we want to have SR-IOV enabled and have some third >> party managing the PF. The first one doesn't really care about the >> communication and would prefer that whatever owns the driver on the PF >> just ignore the SR-IOV portion of the configuration space. The second >> one actually cares and would want some sort of >> communication/cooperation but for now I don't see that as being the >> easiest thing to do so it might be best to just let it see a fixed >> number of VFs it just has to work with that. > > There's no such thing as a read-only sriov capability in the spec, > which is a problem we ran into a while ago, vfio-pci exposes the > capability, but protects it as read-only so the user cannot create > devices on the host. QEMU passed this through to the guest, but that > failed as soon as OVMF started supporting sriov as it's unable to size > the VF BAR resources. Now QEMU drops the sriov capability from the > guest capability chain. So it's not clear how this fixed number of vfs > plan works. Are we inventing our own capability for this? If the pf > driver is just a userspace driver and not a VM, maybe what we do now is > sufficient, otherwise there's no standard for exposing fixed vfs. > >>>> My thought is to look at making SR-IOV configuration static or treat >>>> it as read-only when the vfio-pci driver is loaded on a given >>>> interface. In addition I would set the TAINT_USER flag and add a >>>> warning about loading vfio-pci on an active PF, and provide the number >>>> of VFs that were allocated. >>>> >>>> The idea with all of this being that we would at least have a partial >>>> lock on all of this so that you can allocate some number of VFs, and >>>> then start partitioning things up where you could assign the PF and >>>> VFs as needed to the various interfaces. Once the PF is assigned to >>>> the vfio-pci driver it would be locked in terms of the number of VFs >>>> that are provided so there would be no need for an extra communication >>>> channel between the PF driver and the host to deal with changes. >>>> >>>> Thoughts? >>> >>> Is an sriov configuration ever static? vfio is the most prolific user >>> of the pci reset interfaces, AFAIK. Even if we add support for >>> restoring the sriov configuration and even if the pf user isn't trying >>> to be malicious, vf users would need to be prepared for their device >>> arbitrarily dropping off the bus, mmio randomly (from their >>> perspective) being disabled, and maybe other side-effects of the user >>> just trying to use the pf device. Is there even any guarantee that a >>> pf driver can operate identically, regardless of the state of sriov on >>> the pf? It seems like the pf driver needs to be aware of which queues >>> are available for itself vs the vfs in some designs. Is there still >>> long term value in a solution for this if the kernel is tainted? >> >> As far as the device randomly resetting I don't really see how that is >> any different from what we already have in terms of solutions >> including stuff supported by in-driver SR-IOV. The Intel drivers are >> notorious for resetting for any little thing like an MTU change. Us >> resetting due to someone calling open/release on an interface wouldn't >> be anything new. In addition AER can always hit us too. Yes we don't >> have the same recovery mechanisms in place, but in the case of >> something such as this we probably don't need that complex of a >> recovery setup. > > [I think below paragraph is specifically answering last question above, > still long term value in kernel tainting solution] > >> I would think there probably is. If not we wouldn't have gotten the >> patches earlier that were still doing the tainting and warning and >> making use of the vfio-pci driver to enable SR-IOV. I suspect the use >> case for all this is to enable SR-IOV, setup the PF in vfio-pci, and >> then assign VFs into and out of guests. I don't see the PF doing a lot >> of moving after it is setup. The taint flag only really applies if >> someone is looking for support and quite honestly I figure for now the >> USER flag is appropriate for this kind of thing since we are deferring >> all the networking control to the PF which is managed by userspace. > > If we're going to throw up our hands and taint the kernel, then maybe > we could entertain the idea of doing that in a trivial vfio-pci > sriov_configure function. But does that really meet all the use cases > and what's the advantage of that vs Amazon (I guess they're the driver > here) tainting the kernel with their own driver? Mellanox has tried to > enable sriov in vfio-pci in the past, Cc Ilya. > >>> Minimally, it seems like the pf driver (the user driver, not vfio-pci) >>> needs to be a cooperating party in this exchange. How do we determine >>> that and what QoS guarantees are they providing by agreeing to it? >> >> I think the general idea here is that the user has already tested this >> out and determined for themselves that the user driver does what they >> want/need. If it didn't they wouldn't be going to all these lengths to >> set this up. >> >>> Otherwise it doesn't seem like sriov provides the sort of static, hard >>> partitioning of the device that you're asking for. A vf is a piece of >>> the pf and we've released ownership and control of the pf to a user. >>> Thanks, >>> >>> Alex >> >> I am pretty sure that you are describing is true of some, but not for >> all. I think the Amazon solutions and the virtio solution are doing >> hard partitioning of the part. I will leave it to those guys to speak >> for themselves since I don't know anything about the hardware design >> of those parts. > > I think we'd need device specific knowledge and enablement to be able > to take advantage of any hardware partitioning, otherwise we need to > assume the pf is privileged, as implemented in other sriov devices. > > I'm also trying to imagine whether there's a solution via the new > vfio/mdev interface, where the mdev vendor driver would bind to the pf > and effectively present itself as the mdev device. The vendor driver > could provide sriov capabilities and bear the burden of ensuring that > the pf is used cooperatively. The only existing mdev vendor drivers are > vGPUs and rely on on-device DMA translation and isolation, such as > through GTTs, but there have been some thoughts on providing IOMMU based > isolation of mdev/sriov mixed devices (assuming DMA is even required > for userspace management of the pf in this use case). [Cc Kirti] > Thanks, > > Alex > Apologies for getting to this party late. My 2 cents: I think a stub driver is needed for security reasons. Multiple reasons/cases stated in the thread to (continue to) follow that model. ... and wondering why pci-stub isn't enhanced to do just that, since it's already used in device-assignment tasks (VM or DPDK). -- pf devices that have drivers that coordinate don't need pci-stub; pf devices w/no drivers get assigned to pci-stub, then VF's are enabled/disabled/tracked/managed.... mdev wrapping of VFs doesn't (re)solve device-specific needs/extensions for VF migration either. PF-device-specific needs should be handled by a PF driver; if no PF driver, then assuming no device-specific need exists, and using non-device-specific, common pci-stub VF handling -- even w/pci-quirk hooks/workarounds if they need to be added. Finally, if an ah-ha moment is reached that concludes with the need for a PF driver to handle device-specific issues, the pci-stub VF functions may provide the guide/framework for such hastily put together PF drivers, minimizing PF driver variances for sriov handling. Although a temporary disconnect is handled/emulated well in the (e)net space, I also don't think it's an issue for storage, as I would expect two situations to occur wrt data-loss scenarios: a) a PF device has a kernel driver, and it has to handle the up/down of a VF b) the PF is quiesced -- no device-specific kernel driver -- and there is no VF loss due to PF resetting b/c there is no need to reset PF -- pci-stub providing stable, kernel config. Per VF data loss is always an issue for storage interconnects (unless they are net-based, like iSCSI, RDMA, etc.) which is handled at the VF driver or layer above it. - Don
On Mon, Mar 5, 2018 at 12:57 PM, Don Dutile <ddutile@redhat.com> wrote: > On 03/01/2018 03:22 PM, Alex Williamson wrote: >> >> On Wed, 28 Feb 2018 16:36:38 -0800 >> Alexander Duyck <alexander.duyck@gmail.com> wrote: >> >>> On Wed, Feb 28, 2018 at 2:59 PM, Alex Williamson >>> <alex.williamson@redhat.com> wrote: >>>> >>>> On Wed, 28 Feb 2018 09:49:21 -0800 >>>> Alexander Duyck <alexander.duyck@gmail.com> wrote: >>>> >>>>> >>>>> On Tue, Feb 27, 2018 at 2:25 PM, Alexander Duyck >>>>> <alexander.duyck@gmail.com> wrote: >>>>>> >>>>>> On Tue, Feb 27, 2018 at 1:40 PM, Alex Williamson >>>>>> <alex.williamson@redhat.com> wrote: >>>>>>> >>>>>>> On Tue, 27 Feb 2018 11:06:54 -0800 >>>>>>> Alexander Duyck <alexander.duyck@gmail.com> wrote: >>>>>>> >>>>>>>> >>>>>>>> From: Alexander Duyck <alexander.h.duyck@intel.com> >>>>>>>> >>>>>>>> This patch is meant to add support for SR-IOV on devices when the >>>>>>>> VFs are >>>>>>>> not managed by the kernel. Examples of recent patches attempting to >>>>>>>> do this >>>>>>>> include: >>>>>>> >>>>>>> >>>>>>> It appears to enable sriov when the _pf_ is not managed by the >>>>>>> kernel, but by "managed" we mean that either there is no pf driver or >>>>>>> the pf driver doesn't provide an sriov_configure callback, >>>>>>> intentionally or otherwise. >>>>>>> >>>>>>>> >>>>>>>> virto - https://patchwork.kernel.org/patch/10241225/ >>>>>>>> pci-stub - https://patchwork.kernel.org/patch/10109935/ >>>>>>>> vfio - https://patchwork.kernel.org/patch/10103353/ >>>>>>>> uio - https://patchwork.kernel.org/patch/9974031/ >>>>>>> >>>>>>> >>>>>>> So is the goal to get around the issues with enabling sriov on each >>>>>>> of >>>>>>> the above drivers by doing it under the covers or are you really just >>>>>>> trying to enable sriov for a truly unmanage (no pf driver) case? For >>>>>>> example, should a driver explicitly not wanting sriov enabled >>>>>>> implement >>>>>>> a dummy sriov_configure function? >>>>>>> >>>>>>>> >>>>>>>> Since this is quickly blowing up into a multi-driver problem it is >>>>>>>> probably >>>>>>>> best to implement this solution in one spot. >>>>>>>> >>>>>>>> This patch is an attempt to do that. What we do with this patch is >>>>>>>> provide >>>>>>>> a generic call to enable SR-IOV in the case that the PF driver is >>>>>>>> either >>>>>>>> not present, or the PF driver doesn't support configuring SR-IOV. >>>>>>>> >>>>>>>> A new sysfs value called sriov_unmanaged_autoprobe has been added. >>>>>>>> This >>>>>>>> value is used as the drivers_autoprobe setting of the VFs when they >>>>>>>> are >>>>>>>> being managed by an external entity such as userspace or device >>>>>>>> firmware >>>>>>>> instead of being managed by the kernel. >>>>>>> >>>>>>> >>>>>>> Documentation/ABI/testing/sysfs-bus-pci update is missing. >>>>>> >>>>>> >>>>>> I can make sure to update that in the next version. >>>>>> >>>>>>>> >>>>>>>> One side effect of this change is that the sriov_drivers_autoprobe >>>>>>>> and >>>>>>>> sriov_unmanaged_autoprobe will only apply their updates when SR-IOV >>>>>>>> is >>>>>>>> disabled. Attempts to update them when SR-IOV is in use will only >>>>>>>> update >>>>>>>> the local value and will not update sriov->autoprobe. >>>>>>> >>>>>>> >>>>>>> And we expect users to understand when sriov_drivers_autoprobe >>>>>>> applies >>>>>>> vs sriov_unmanaged_autoprobe, even though they're using the same >>>>>>> interfaces to enable sriov? Are all combinations expected to work, >>>>>>> ex. >>>>>>> unmanaged sriov is enabled, a native pf driver loads, vfs work? Not >>>>>>> only does it seems like there's opportunity to use this incorrectly, >>>>>>> I >>>>>>> think maybe it might be difficult to use correctly. >>>>>>> >>>>>>>> >>>>>>>> I based my patch set originally on the patch by Mark Rustad but >>>>>>>> there isn't >>>>>>>> much left after going through and cleaning out the bits that were no >>>>>>>> longer >>>>>>>> needed, and after incorporating the feedback from David Miller. >>>>>>>> >>>>>>>> I have included the authors of the original 4 patches above in the >>>>>>>> Cc here. >>>>>>>> My hope is to get feedback and/or review on if this works for their >>>>>>>> use >>>>>>>> cases. >>>>>>>> >>>>>>>> Cc: Mark Rustad <mark.d.rustad@intel.com> >>>>>>>> Cc: Maximilian Heyne <mheyne@amazon.de> >>>>>>>> Cc: Liang-Min Wang <liang-min.wang@intel.com> >>>>>>>> Cc: David Woodhouse <dwmw@amazon.co.uk> >>>>>>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >>>>>>>> --- >>>>>>>> drivers/pci/iov.c | 27 +++++++++++++++++++- >>>>>>>> drivers/pci/pci-driver.c | 2 + >>>>>>>> drivers/pci/pci-sysfs.c | 62 >>>>>>>> +++++++++++++++++++++++++++++++++++++++++----- >>>>>>>> drivers/pci/pci.h | 4 ++- >>>>>>>> include/linux/pci.h | 1 + >>>>>>>> 5 files changed, 86 insertions(+), 10 deletions(-) >>>>>>>> >>>>>>>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c >>>>>>>> index 677924ae0350..7b8858bd8d03 100644 >>>>>>>> --- a/drivers/pci/iov.c >>>>>>>> +++ b/drivers/pci/iov.c >>>>>>>> @@ -446,6 +446,7 @@ static int sriov_init(struct pci_dev *dev, int >>>>>>>> pos) >>>>>>>> pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, >>>>>>>> &iov->vf_device); >>>>>>>> iov->pgsz = pgsz; >>>>>>>> iov->self = dev; >>>>>>>> + iov->autoprobe = true; >>>>>>>> iov->drivers_autoprobe = true; >>>>>>>> pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, &iov->cap); >>>>>>>> pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, >>>>>>>> &iov->link); >>>>>>>> @@ -643,8 +644,11 @@ void pci_restore_iov_state(struct pci_dev *dev) >>>>>>>> */ >>>>>>>> void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool >>>>>>>> auto_probe) >>>>>>>> { >>>>>>>> - if (dev->is_physfn) >>>>>>>> + if (dev->is_physfn) { >>>>>>>> dev->sriov->drivers_autoprobe = auto_probe; >>>>>>>> + if (!dev->sriov->num_VFs) >>>>>>>> + dev->sriov->autoprobe = auto_probe; >>>>>>> >>>>>>> >>>>>>> Why is dev->sriov->autoprobe set any time other than immediately >>>>>>> prior >>>>>>> to enabling VFs? >>>>>> >>>>>> >>>>>> My concern here was drivers that are still floating around with the >>>>>> old module parameter option for enabling SR-IOV. In the unlikely event >>>>>> that somebody was to use such a driver I wanted to make certain that >>>>>> the drivers_autoprobe state was pre-populated. >>>> >>>> >>>> Good point, but maybe that just means we should be setting it in >>>> sriov_enable()? >>> >>> >>> I suppose we could. I would just have to check and see if we have any >>> drivers lurking out there that are supporting SR-IOV without >>> supporting the sysfs approach. As long as that is the case we could >>> probably put it there. >>> >>>>>>>> + } >>>>>>>> } >>>>>>>> >>>>>>>> /** >>>>>>>> @@ -703,6 +707,27 @@ void pci_disable_sriov(struct pci_dev *dev) >>>>>>>> EXPORT_SYMBOL_GPL(pci_disable_sriov); >>>>>>>> >>>>>>>> /** >>>>>>>> + * pci_sriov_configure_unmanaged - helper to configure unmanaged >>>>>>>> SR-IOV >>>>>>>> + * @dev: the PCI device >>>>>>>> + * @nr_virtfn: number of virtual functions to enable, 0 to disable >>>>>>>> + * >>>>>>>> + * Used to provide generic enable/disable SR-IOV option for devices >>>>>>>> + * that do not manage the VFs generated by their driver, or have no >>>>>>>> + * driver present. >>>>>>>> + */ >>>>>>>> +int pci_sriov_configure_unmanaged(struct pci_dev *dev, int >>>>>>>> nr_virtfn) >>>>>>>> +{ >>>>>>>> + int rc = 0; >>>>>>>> + >>>>>>>> + if (!nr_virtfn) >>>>>>>> + pci_disable_sriov(dev); >>>>>>>> + else >>>>>>>> + rc = pci_enable_sriov(dev, nr_virtfn); >>>>>>>> + >>>>>>>> + return rc ? rc : nr_virtfn; >>>>>>>> +} >>>>>>>> + >>>>>>>> +/** >>>>>>>> * pci_num_vf - return number of VFs associated with a PF >>>>>>>> device_release_driver >>>>>>>> * @dev: the PCI device >>>>>>>> * >>>>>>>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c >>>>>>>> index 3bed6beda051..2cc68dff6130 100644 >>>>>>>> --- a/drivers/pci/pci-driver.c >>>>>>>> +++ b/drivers/pci/pci-driver.c >>>>>>>> @@ -398,7 +398,7 @@ void __weak pcibios_free_irq(struct pci_dev >>>>>>>> *dev) >>>>>>>> #ifdef CONFIG_PCI_IOV >>>>>>>> static inline bool pci_device_can_probe(struct pci_dev *pdev) >>>>>>>> { >>>>>>>> - return (!pdev->is_virtfn || >>>>>>>> pdev->physfn->sriov->drivers_autoprobe); >>>>>>>> + return (!pdev->is_virtfn || pdev->physfn->sriov->autoprobe); >>>>>>>> } >>>>>>>> #else >>>>>>>> static inline bool pci_device_can_probe(struct pci_dev *pdev) >>>>>>>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c >>>>>>>> index eb6bee8724cc..e701b6dbc267 100644 >>>>>>>> --- a/drivers/pci/pci-sysfs.c >>>>>>>> +++ b/drivers/pci/pci-sysfs.c >>>>>>>> @@ -605,6 +605,7 @@ static ssize_t sriov_numvfs_store(struct device >>>>>>>> *dev, >>>>>>>> struct device_attribute *attr, >>>>>>>> const char *buf, size_t count) >>>>>>>> { >>>>>>>> + int (*sriov_configure)(struct pci_dev *dev, int num_vfs); >>>>>>>> struct pci_dev *pdev = to_pci_dev(dev); >>>>>>>> int ret; >>>>>>>> u16 num_vfs; >>>>>>>> @@ -622,15 +623,20 @@ static ssize_t sriov_numvfs_store(struct >>>>>>>> device *dev, >>>>>>>> goto exit; >>>>>>>> >>>>>>>> /* is PF driver loaded w/callback */ >>>>>>>> - if (!pdev->driver || !pdev->driver->sriov_configure) { >>>>>>>> - pci_info(pdev, "Driver doesn't support SRIOV >>>>>>>> configuration via sysfs\n"); >>>>>>>> - ret = -ENOENT; >>>>>>>> - goto exit; >>>>>>>> - } >>>>>>>> + if (pdev->driver && pdev->driver->sriov_configure) >>>>>>>> + sriov_configure = pdev->driver->sriov_configure; >>>>>>>> + else >>>>>>>> + sriov_configure = pci_sriov_configure_unmanaged; >>>>>>> >>>>>>> >>>>>>> So an unwitting user is now able to enable vfs, independent of the >>>>>>> pf... the trouble being that they're probably going to expect them to >>>>>>> work and the more common case is that they won't. For instance, what >>>>>>> can you do with an igbvf when igb isn't managing the pf? >>>>>> >>>>>> >>>>>> Well the VFs wouldn't be able to do anything. Basically they would be >>>>>> sitting there with no driver loaded on them unless they are assigned >>>>>> to a guest, or the root user had enabled the unmanaged option. If you >>>>>> did load a driver on it the VF would sit there with link down unless >>>>>> either the PF driver is loaded or some user-space entity steps in to >>>>>> start managing the PF. >>>>>> >>>>>> In reality this can already happen as last I recall igb and ixgbe were >>>>>> already capable of having the PF driver removed when SR-IOV was >>>>>> enabled and VFs were assigned. Basically the VFs just report link down >>>>>> and don't do anything. Reloading the PF driver would have it take over >>>>>> in the case of igb and ixgbe since they were designed to handle that >>>>>> type of scenario. >>>> >>>> >>>> I think only ixgbe behaves this way between the two, igb disables sriov >>>> in its remove function unless we're hung up by that silly >>>> pci_vfs_assigned() check, which doesn't apply to vfio assignment. >>>> Regardless, yes, some pf drivers do leave sriov enabled on remove, >>>> whether that's useful or reasonable is another question. >>> >>> >>> Right. We can argue that another day. I was just sighting the igb >>> behavior. >>> >>>>>>> Or what happens when vfio-pci owns the pf, sriov is enabled via the >>>>>>> unmanaged interface, and the pf user driver segfaults and gets >>>>>>> killed, >>>>>>> causing vfio-pci to restore the pf state, including wiping the sriov >>>>>>> config? >>>>>> >>>>>> >>>>>> Wiping the config shouldn't have any effect on the allocated VF pci >>>>>> devices. It will cause the VFs to act as though they have fallen off >>>>>> of the bus though and the guests would see a surprise remove type >>>>>> behavior if I am not mistaken. The MMIO and config accesses would fail >>>>>> until the SR-IOV configuration is restored. Really this shouldn't be a >>>>>> problem as long as the SR-IOV is enabled prior to loading the vfio-pci >>>>>> driver as I would assume it would restore the state an re-enable >>>>>> SR-IOV. >>>> >>>> >>>> Nope, sriov does not appear to be part of the device state saved and >>>> restored around reset, so I think we just end up in an inconsistent >>>> state. Also, vfio-pci stores a copy of the saved state of a device >>>> prior to giving the user access to the device and tries to restore that >>>> copy when released by the user, so anything that's part of the save >>>> state and modified via side-channels while the device is opened, would >>>> still be lost. >>> >>> >>> Well it is but it isn't. The pci_restore_state will call >>> pci_restore_iov_state which will just repopulate the data straight out >>> of the pdev->sriov structure. I think the assumption is we were >>> already carrying enough information that there wasn't any point in >>> saving the state since we could just restore it from that structure >>> anyway. >> >> >> Ah, right, I remember that now. >> >>> >>> I've been chasing it down and can verify that much this morning after >>> testing. A regular hit of the sysfs reset control will not erase the >>> state. >> >> >> Perhaps long term, but momentarily the vf disappears and that could >> imply data loss, no? >> >>> One issue that I think I found that may be a bug though is if I >>> load the vfio-pci driver on an interface and unbind it I am seeing the >>> port state reset. I have it chased down to the idle D3 code. It looks >>> like going from vfio_pci_probe to vfio_pci_remove is triggering the >>> equivilent of the pci_pm_reset since it is cycling me through >>> D0->D3->D0 without restoring state after the fact. I also verified >>> that setting disable_idle_d3 resolves the issue. Would you have any >>> complaints about me doing a save_state in the probe call, and a >>> restore_state in the remove? It seems like that should probably be the >>> easiest fix. >> >> >> Hmm, perhaps I had assumed that pci_set_power_state() would handle such >> things, but indeed I do see other drivers calling pci_save_state() >> prior and pci_restore_state() after. I think we'd need to audit all of >> vfio-pci's calls to pci_set_power_state(), not just probe and remove. >> >>>>>> In the grand scheme of things how would the situation you describe be >>>>>> any different than someone using the "reset" sysfs control on the PF >>>>>> while using SR-IOV with driver supported SR-IOV? >>>> >>>> >>>> Similar, but the pf driver has enabled sriov and can register the >>>> reset_done error handler to be notified to re-enable sriov or perform >>>> other cleanup. If the pf driver is not participating in sriov, it >>>> would seem exceptional to support this. Triggering reset via sysfs >>>> also falls more closely to shooting yourself in the foot than I think >>>> we want to design into driver/subsystem APIs. >>> >>> >>> To some extent I agree with the shooting yourself in the foot. At the >>> same time there isn't actually all that much to re-enable SR-IOV. >>> Restoring the PCI SR-IOV configuration space and re-enabling the bus >>> master enable. >> >> >> Bus master is a pretty interesting example of the trivial control a pf >> driver can exert on the vfs though. >> >>> The Amazon guys would probably know better than I since I haven't >>> really worked much with one of these parts yet. Actually the virtio >>> that Mark pushed may behave the same way too. As far as I know in >>> these firmware cases the hardware itself has everything >>> pre-partitioned and set to re-enable as soon as the SR-IOV bits are >>> set. I think all they need is a few bits toggled and they are pretty >>> much good to go. >> >> >> Are they just looking for an sriov capable stub driver? With >> increasing vf counts, being able to use something like vfio-pci on the >> pf seems like all risk with statistically insignificant increase in >> density. On the other hand, if there's a userspace pf management >> driver, why not just make it trusted by adding it as a native host >> kernel driver? If we're talking about tainting the host kernel to >> enable this interaction, maybe it should just be tainted by an out of >> tree, possibly non-gpl host pf driver anyway. There can't really be a >> case where the pf would be used by an average user without some degree >> of privilege or cooperation, right? >> >>>>>> >>>>>> I suppose if you really wanted we could add a new call that you could >>>>>> put into the sriov_configure pointer that would just make it always >>>>>> return error. Then that way the configuration could be locked until >>>>>> the driver is unloaded. >>>>>> >>>>>>> >>>>>>> I guess I don't understand how vfs can operate fully independent of >>>>>>> the >>>>>>> pf and why vfio-pci wouldn't just implement a dummy sriov_configure >>>>>>> to >>>>>>> avoid contending with such issues. >>>>>> >>>>>> >>>>>> This approach isn't without risk, but it isn't as if it is really a >>>>>> new risk we are introducing, and we could do further work to help >>>>>> smooth out issues like this. Much of this depends on the design of the >>>>>> device and the drivers involved. What is happening in this case is >>>>>> that the VFs are managed outside of the kernel itself either by some >>>>>> user-space entity or by a firmware entity. >>>> >>>> >>>> But we can't simply split vfs and pfs as separate resources, the vf is >>>> dependent on the pf and that implies a privilege hierarchy, if not at >>>> least cooperation. The pf driver can intentionally or unintentionally >>>> disconnect the vfs at any point in time, possibly leading to data loss >>>> or at least denial of service for the vf consumer. I also don't trust >>>> that there aren't numerous sriov implementations where the pf isn't >>>> privileged to the point of being able to snoop data from the vf. So >>>> what sort of usage models are we enabling for the vf? A firmware owned >>>> vf subject to these conditions seems mostly pointless. >>> >>> >>> In the case of these sort of parts the PF isn't really all that >>> privileged. Usually the PF is just a VF with the SR-IOV capability >>> hanging off of it. I suspect the only thing that might outright >>> control anything would be the bus master enable bit. Everything else >>> could be independent. The PF driver in such cases doesn't do much. It >>> is basically just the host for the configuration space. >> >> >> This is sounding more like an sriov capable stub driver. Certainly not >> all pfs behave the way you describe above, many are significantly more >> privileged and even if not, bus master is a pretty trivial control >> point. So we probably need a driver that claims devices known to >> behave this way, or a meta driver that bind via dynamic IDs or >> driver_override. The proposal here sort of covertly turns anything >> that's not an sriov driver into that stub driver with no guarantee that >> the subverted driver is a willing or safe participant. >> >>>>> So I was thinking about this some more. In the case of vfio-pci things >>>>> are a bit different since you are essentially taking a given device >>>>> and handing it to a VM or userspace and it doesn't guarantee a >>>>> communication between the two. >>>> >>>> >>>> Doesn't guarantee communication or cooperation or even trust. >>> >>> >>> Right, but at the same time I consider this to be a shoot yourself in >>> the foot type scenario. If you are going to hand off your PF while VFs >>> are active then you are asking for whatever repercussions you end up >>> getting. I've added a warning and a TAINT_USER flag to my code at this >>> point if you load an active PF in vfio, and I have added a function >>> that locks the setting so it cannot be changed once we place a PF in >>> the control of vfio-pci. >>> >>> The way I see it there are two scenarios. One where the PF is just a >>> VF with an extra bit of SR-IOV configuration space floating off of it. >>> The other is where we want to have SR-IOV enabled and have some third >>> party managing the PF. The first one doesn't really care about the >>> communication and would prefer that whatever owns the driver on the PF >>> just ignore the SR-IOV portion of the configuration space. The second >>> one actually cares and would want some sort of >>> communication/cooperation but for now I don't see that as being the >>> easiest thing to do so it might be best to just let it see a fixed >>> number of VFs it just has to work with that. >> >> >> There's no such thing as a read-only sriov capability in the spec, >> which is a problem we ran into a while ago, vfio-pci exposes the >> capability, but protects it as read-only so the user cannot create >> devices on the host. QEMU passed this through to the guest, but that >> failed as soon as OVMF started supporting sriov as it's unable to size >> the VF BAR resources. Now QEMU drops the sriov capability from the >> guest capability chain. So it's not clear how this fixed number of vfs >> plan works. Are we inventing our own capability for this? If the pf >> driver is just a userspace driver and not a VM, maybe what we do now is >> sufficient, otherwise there's no standard for exposing fixed vfs. >> >>>>> My thought is to look at making SR-IOV configuration static or treat >>>>> it as read-only when the vfio-pci driver is loaded on a given >>>>> interface. In addition I would set the TAINT_USER flag and add a >>>>> warning about loading vfio-pci on an active PF, and provide the number >>>>> of VFs that were allocated. >>>>> >>>>> The idea with all of this being that we would at least have a partial >>>>> lock on all of this so that you can allocate some number of VFs, and >>>>> then start partitioning things up where you could assign the PF and >>>>> VFs as needed to the various interfaces. Once the PF is assigned to >>>>> the vfio-pci driver it would be locked in terms of the number of VFs >>>>> that are provided so there would be no need for an extra communication >>>>> channel between the PF driver and the host to deal with changes. >>>>> >>>>> Thoughts? >>>> >>>> >>>> Is an sriov configuration ever static? vfio is the most prolific user >>>> of the pci reset interfaces, AFAIK. Even if we add support for >>>> restoring the sriov configuration and even if the pf user isn't trying >>>> to be malicious, vf users would need to be prepared for their device >>>> arbitrarily dropping off the bus, mmio randomly (from their >>>> perspective) being disabled, and maybe other side-effects of the user >>>> just trying to use the pf device. Is there even any guarantee that a >>>> pf driver can operate identically, regardless of the state of sriov on >>>> the pf? It seems like the pf driver needs to be aware of which queues >>>> are available for itself vs the vfs in some designs. Is there still >>>> long term value in a solution for this if the kernel is tainted? >>> >>> >>> As far as the device randomly resetting I don't really see how that is >>> any different from what we already have in terms of solutions >>> including stuff supported by in-driver SR-IOV. The Intel drivers are >>> notorious for resetting for any little thing like an MTU change. Us >>> resetting due to someone calling open/release on an interface wouldn't >>> be anything new. In addition AER can always hit us too. Yes we don't >>> have the same recovery mechanisms in place, but in the case of >>> something such as this we probably don't need that complex of a >>> recovery setup. >> >> >> [I think below paragraph is specifically answering last question above, >> still long term value in kernel tainting solution] >> >>> I would think there probably is. If not we wouldn't have gotten the >>> patches earlier that were still doing the tainting and warning and >>> making use of the vfio-pci driver to enable SR-IOV. I suspect the use >>> case for all this is to enable SR-IOV, setup the PF in vfio-pci, and >>> then assign VFs into and out of guests. I don't see the PF doing a lot >>> of moving after it is setup. The taint flag only really applies if >>> someone is looking for support and quite honestly I figure for now the >>> USER flag is appropriate for this kind of thing since we are deferring >>> all the networking control to the PF which is managed by userspace. >> >> >> If we're going to throw up our hands and taint the kernel, then maybe >> we could entertain the idea of doing that in a trivial vfio-pci >> sriov_configure function. But does that really meet all the use cases >> and what's the advantage of that vs Amazon (I guess they're the driver >> here) tainting the kernel with their own driver? Mellanox has tried to >> enable sriov in vfio-pci in the past, Cc Ilya. >> >>>> Minimally, it seems like the pf driver (the user driver, not vfio-pci) >>>> needs to be a cooperating party in this exchange. How do we determine >>>> that and what QoS guarantees are they providing by agreeing to it? >>> >>> >>> I think the general idea here is that the user has already tested this >>> out and determined for themselves that the user driver does what they >>> want/need. If it didn't they wouldn't be going to all these lengths to >>> set this up. >>> >>>> Otherwise it doesn't seem like sriov provides the sort of static, hard >>>> partitioning of the device that you're asking for. A vf is a piece of >>>> the pf and we've released ownership and control of the pf to a user. >>>> Thanks, >>>> >>>> Alex >>> >>> >>> I am pretty sure that you are describing is true of some, but not for >>> all. I think the Amazon solutions and the virtio solution are doing >>> hard partitioning of the part. I will leave it to those guys to speak >>> for themselves since I don't know anything about the hardware design >>> of those parts. >> >> >> I think we'd need device specific knowledge and enablement to be able >> to take advantage of any hardware partitioning, otherwise we need to >> assume the pf is privileged, as implemented in other sriov devices. >> >> I'm also trying to imagine whether there's a solution via the new >> vfio/mdev interface, where the mdev vendor driver would bind to the pf >> and effectively present itself as the mdev device. The vendor driver >> could provide sriov capabilities and bear the burden of ensuring that >> the pf is used cooperatively. The only existing mdev vendor drivers are >> vGPUs and rely on on-device DMA translation and isolation, such as >> through GTTs, but there have been some thoughts on providing IOMMU based >> isolation of mdev/sriov mixed devices (assuming DMA is even required >> for userspace management of the pf in this use case). [Cc Kirti] >> Thanks, >> >> Alex >> > Apologies for getting to this party late. > > My 2 cents: > > I think a stub driver is needed for security reasons. > Multiple reasons/cases stated in the thread to (continue to) follow that > model. > ... and wondering why pci-stub isn't enhanced to do just that, since > it's already used in device-assignment tasks (VM or DPDK). There was an attempt to update pci-stub: https://patchwork.kernel.org/patch/10109935/ One issue with using pci-stub is it doesn't address the security issues needed if an IOMMU is enabled. For that reason I think vfio-pci might be a better choice, and that is why my v2 is based around it. For my v2 of this patch set I dropped the pure generic solution and instead just updated virtio and vfio-pci in order to meet the needs as called out in the previous patch sets. > -- pf devices that have drivers that coordinate don't need pci-stub; > pf devices w/no drivers get assigned to pci-stub, then VF's are > enabled/disabled/tracked/managed.... Yeah, my first attempt at this was to avoid doing what I consider to be somewhat ridiculous things like adding SR-IOV support to virtio_net. It didn't seem right to me but in the grand scheme of things that ends up kind of being the way we have to go in order to put some boundaries on when/where SR-IOV can be enabled. > mdev wrapping of VFs doesn't (re)solve device-specific needs/extensions > for VF migration either. > PF-device-specific needs should be handled by a PF driver; > if no PF driver, then assuming no device-specific need exists, and using > non-device-specific, common pci-stub VF handling The issue here is how do you classify that there is no common PF driver, or is this something that would be decided by the user? Also I don't know if this addresses the needs fro the DPDK crowd to enable a userspace PF driver? > -- even w/pci-quirk hooks/workarounds if they need to be added. > Finally, if an ah-ha moment is reached that concludes with the > need for a PF driver to handle device-specific issues, the pci-stub VF > functions may provide the guide/framework for such hastily put together PF > drivers, > minimizing PF driver variances for sriov handling. I'm not sure I followed this part. Are you saying the pci-stub driver gets used as a template essentially for creating a PF driver if needed? > Although a temporary disconnect is handled/emulated well in the (e)net > space, > I also don't think it's an issue for storage, as I would expect two > situations to occur > wrt data-loss scenarios: > a) a PF device has a kernel driver, and it has to handle the up/down of a VF > b) the PF is quiesced -- no device-specific kernel driver -- > and there is no VF loss due to PF resetting b/c there > is no need to reset PF -- pci-stub providing stable, kernel config. > Per VF data loss is always an issue for storage interconnects > (unless they are net-based, like iSCSI, RDMA, etc.) which is handled > at the VF driver or layer above it. > > - Don My concern is we are starting to see a third option here. A PF that has a kernel driver, but knows nothing of SR-IOV as it is completely managed by firmware. An example being the virtio_net that somehow knows how to do SR-IOV (https://patchwork.kernel.org/patch/10241225/). My understanding is that the part in question is providing a lightweight PF that is essentially just another VF, but has some extra PCIe configuration space to it allowing for SR-IOV and AER functionality. In addition we are also seeing customer scenarios where things like DPDK are running as the PF with kernel VFs. Ideally we want to contain that as much as possible and my understanding is that igb_uio currently isn't really providing much in the way of containment. - Alex
On 03/05/2018 04:41 PM, Alexander Duyck wrote: > On Mon, Mar 5, 2018 at 12:57 PM, Don Dutile <ddutile@redhat.com> wrote: >> On 03/01/2018 03:22 PM, Alex Williamson wrote: >>> >>> On Wed, 28 Feb 2018 16:36:38 -0800 >>> Alexander Duyck <alexander.duyck@gmail.com> wrote: >>> >>>> On Wed, Feb 28, 2018 at 2:59 PM, Alex Williamson >>>> <alex.williamson@redhat.com> wrote: >>>>> >>>>> On Wed, 28 Feb 2018 09:49:21 -0800 >>>>> Alexander Duyck <alexander.duyck@gmail.com> wrote: >>>>> >>>>>> >>>>>> On Tue, Feb 27, 2018 at 2:25 PM, Alexander Duyck >>>>>> <alexander.duyck@gmail.com> wrote: >>>>>>> >>>>>>> On Tue, Feb 27, 2018 at 1:40 PM, Alex Williamson >>>>>>> <alex.williamson@redhat.com> wrote: >>>>>>>> >>>>>>>> On Tue, 27 Feb 2018 11:06:54 -0800 >>>>>>>> Alexander Duyck <alexander.duyck@gmail.com> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> From: Alexander Duyck <alexander.h.duyck@intel.com> >>>>>>>>> >>>>>>>>> This patch is meant to add support for SR-IOV on devices when the >>>>>>>>> VFs are >>>>>>>>> not managed by the kernel. Examples of recent patches attempting to >>>>>>>>> do this >>>>>>>>> include: >>>>>>>> >>>>>>>> >>>>>>>> It appears to enable sriov when the _pf_ is not managed by the >>>>>>>> kernel, but by "managed" we mean that either there is no pf driver or >>>>>>>> the pf driver doesn't provide an sriov_configure callback, >>>>>>>> intentionally or otherwise. >>>>>>>> >>>>>>>>> >>>>>>>>> virto - https://patchwork.kernel.org/patch/10241225/ >>>>>>>>> pci-stub - https://patchwork.kernel.org/patch/10109935/ >>>>>>>>> vfio - https://patchwork.kernel.org/patch/10103353/ >>>>>>>>> uio - https://patchwork.kernel.org/patch/9974031/ >>>>>>>> >>>>>>>> >>>>>>>> So is the goal to get around the issues with enabling sriov on each >>>>>>>> of >>>>>>>> the above drivers by doing it under the covers or are you really just >>>>>>>> trying to enable sriov for a truly unmanage (no pf driver) case? For >>>>>>>> example, should a driver explicitly not wanting sriov enabled >>>>>>>> implement >>>>>>>> a dummy sriov_configure function? >>>>>>>> >>>>>>>>> >>>>>>>>> Since this is quickly blowing up into a multi-driver problem it is >>>>>>>>> probably >>>>>>>>> best to implement this solution in one spot. >>>>>>>>> >>>>>>>>> This patch is an attempt to do that. What we do with this patch is >>>>>>>>> provide >>>>>>>>> a generic call to enable SR-IOV in the case that the PF driver is >>>>>>>>> either >>>>>>>>> not present, or the PF driver doesn't support configuring SR-IOV. >>>>>>>>> >>>>>>>>> A new sysfs value called sriov_unmanaged_autoprobe has been added. >>>>>>>>> This >>>>>>>>> value is used as the drivers_autoprobe setting of the VFs when they >>>>>>>>> are >>>>>>>>> being managed by an external entity such as userspace or device >>>>>>>>> firmware >>>>>>>>> instead of being managed by the kernel. >>>>>>>> >>>>>>>> >>>>>>>> Documentation/ABI/testing/sysfs-bus-pci update is missing. >>>>>>> >>>>>>> >>>>>>> I can make sure to update that in the next version. >>>>>>> >>>>>>>>> >>>>>>>>> One side effect of this change is that the sriov_drivers_autoprobe >>>>>>>>> and >>>>>>>>> sriov_unmanaged_autoprobe will only apply their updates when SR-IOV >>>>>>>>> is >>>>>>>>> disabled. Attempts to update them when SR-IOV is in use will only >>>>>>>>> update >>>>>>>>> the local value and will not update sriov->autoprobe. >>>>>>>> >>>>>>>> >>>>>>>> And we expect users to understand when sriov_drivers_autoprobe >>>>>>>> applies >>>>>>>> vs sriov_unmanaged_autoprobe, even though they're using the same >>>>>>>> interfaces to enable sriov? Are all combinations expected to work, >>>>>>>> ex. >>>>>>>> unmanaged sriov is enabled, a native pf driver loads, vfs work? Not >>>>>>>> only does it seems like there's opportunity to use this incorrectly, >>>>>>>> I >>>>>>>> think maybe it might be difficult to use correctly. >>>>>>>> >>>>>>>>> >>>>>>>>> I based my patch set originally on the patch by Mark Rustad but >>>>>>>>> there isn't >>>>>>>>> much left after going through and cleaning out the bits that were no >>>>>>>>> longer >>>>>>>>> needed, and after incorporating the feedback from David Miller. >>>>>>>>> >>>>>>>>> I have included the authors of the original 4 patches above in the >>>>>>>>> Cc here. >>>>>>>>> My hope is to get feedback and/or review on if this works for their >>>>>>>>> use >>>>>>>>> cases. >>>>>>>>> >>>>>>>>> Cc: Mark Rustad <mark.d.rustad@intel.com> >>>>>>>>> Cc: Maximilian Heyne <mheyne@amazon.de> >>>>>>>>> Cc: Liang-Min Wang <liang-min.wang@intel.com> >>>>>>>>> Cc: David Woodhouse <dwmw@amazon.co.uk> >>>>>>>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >>>>>>>>> --- >>>>>>>>> drivers/pci/iov.c | 27 +++++++++++++++++++- >>>>>>>>> drivers/pci/pci-driver.c | 2 + >>>>>>>>> drivers/pci/pci-sysfs.c | 62 >>>>>>>>> +++++++++++++++++++++++++++++++++++++++++----- >>>>>>>>> drivers/pci/pci.h | 4 ++- >>>>>>>>> include/linux/pci.h | 1 + >>>>>>>>> 5 files changed, 86 insertions(+), 10 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c >>>>>>>>> index 677924ae0350..7b8858bd8d03 100644 >>>>>>>>> --- a/drivers/pci/iov.c >>>>>>>>> +++ b/drivers/pci/iov.c >>>>>>>>> @@ -446,6 +446,7 @@ static int sriov_init(struct pci_dev *dev, int >>>>>>>>> pos) >>>>>>>>> pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, >>>>>>>>> &iov->vf_device); >>>>>>>>> iov->pgsz = pgsz; >>>>>>>>> iov->self = dev; >>>>>>>>> + iov->autoprobe = true; >>>>>>>>> iov->drivers_autoprobe = true; >>>>>>>>> pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, &iov->cap); >>>>>>>>> pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, >>>>>>>>> &iov->link); >>>>>>>>> @@ -643,8 +644,11 @@ void pci_restore_iov_state(struct pci_dev *dev) >>>>>>>>> */ >>>>>>>>> void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool >>>>>>>>> auto_probe) >>>>>>>>> { >>>>>>>>> - if (dev->is_physfn) >>>>>>>>> + if (dev->is_physfn) { >>>>>>>>> dev->sriov->drivers_autoprobe = auto_probe; >>>>>>>>> + if (!dev->sriov->num_VFs) >>>>>>>>> + dev->sriov->autoprobe = auto_probe; >>>>>>>> >>>>>>>> >>>>>>>> Why is dev->sriov->autoprobe set any time other than immediately >>>>>>>> prior >>>>>>>> to enabling VFs? >>>>>>> >>>>>>> >>>>>>> My concern here was drivers that are still floating around with the >>>>>>> old module parameter option for enabling SR-IOV. In the unlikely event >>>>>>> that somebody was to use such a driver I wanted to make certain that >>>>>>> the drivers_autoprobe state was pre-populated. >>>>> >>>>> >>>>> Good point, but maybe that just means we should be setting it in >>>>> sriov_enable()? >>>> >>>> >>>> I suppose we could. I would just have to check and see if we have any >>>> drivers lurking out there that are supporting SR-IOV without >>>> supporting the sysfs approach. As long as that is the case we could >>>> probably put it there. >>>> >>>>>>>>> + } >>>>>>>>> } >>>>>>>>> >>>>>>>>> /** >>>>>>>>> @@ -703,6 +707,27 @@ void pci_disable_sriov(struct pci_dev *dev) >>>>>>>>> EXPORT_SYMBOL_GPL(pci_disable_sriov); >>>>>>>>> >>>>>>>>> /** >>>>>>>>> + * pci_sriov_configure_unmanaged - helper to configure unmanaged >>>>>>>>> SR-IOV >>>>>>>>> + * @dev: the PCI device >>>>>>>>> + * @nr_virtfn: number of virtual functions to enable, 0 to disable >>>>>>>>> + * >>>>>>>>> + * Used to provide generic enable/disable SR-IOV option for devices >>>>>>>>> + * that do not manage the VFs generated by their driver, or have no >>>>>>>>> + * driver present. >>>>>>>>> + */ >>>>>>>>> +int pci_sriov_configure_unmanaged(struct pci_dev *dev, int >>>>>>>>> nr_virtfn) >>>>>>>>> +{ >>>>>>>>> + int rc = 0; >>>>>>>>> + >>>>>>>>> + if (!nr_virtfn) >>>>>>>>> + pci_disable_sriov(dev); >>>>>>>>> + else >>>>>>>>> + rc = pci_enable_sriov(dev, nr_virtfn); >>>>>>>>> + >>>>>>>>> + return rc ? rc : nr_virtfn; >>>>>>>>> +} >>>>>>>>> + >>>>>>>>> +/** >>>>>>>>> * pci_num_vf - return number of VFs associated with a PF >>>>>>>>> device_release_driver >>>>>>>>> * @dev: the PCI device >>>>>>>>> * >>>>>>>>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c >>>>>>>>> index 3bed6beda051..2cc68dff6130 100644 >>>>>>>>> --- a/drivers/pci/pci-driver.c >>>>>>>>> +++ b/drivers/pci/pci-driver.c >>>>>>>>> @@ -398,7 +398,7 @@ void __weak pcibios_free_irq(struct pci_dev >>>>>>>>> *dev) >>>>>>>>> #ifdef CONFIG_PCI_IOV >>>>>>>>> static inline bool pci_device_can_probe(struct pci_dev *pdev) >>>>>>>>> { >>>>>>>>> - return (!pdev->is_virtfn || >>>>>>>>> pdev->physfn->sriov->drivers_autoprobe); >>>>>>>>> + return (!pdev->is_virtfn || pdev->physfn->sriov->autoprobe); >>>>>>>>> } >>>>>>>>> #else >>>>>>>>> static inline bool pci_device_can_probe(struct pci_dev *pdev) >>>>>>>>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c >>>>>>>>> index eb6bee8724cc..e701b6dbc267 100644 >>>>>>>>> --- a/drivers/pci/pci-sysfs.c >>>>>>>>> +++ b/drivers/pci/pci-sysfs.c >>>>>>>>> @@ -605,6 +605,7 @@ static ssize_t sriov_numvfs_store(struct device >>>>>>>>> *dev, >>>>>>>>> struct device_attribute *attr, >>>>>>>>> const char *buf, size_t count) >>>>>>>>> { >>>>>>>>> + int (*sriov_configure)(struct pci_dev *dev, int num_vfs); >>>>>>>>> struct pci_dev *pdev = to_pci_dev(dev); >>>>>>>>> int ret; >>>>>>>>> u16 num_vfs; >>>>>>>>> @@ -622,15 +623,20 @@ static ssize_t sriov_numvfs_store(struct >>>>>>>>> device *dev, >>>>>>>>> goto exit; >>>>>>>>> >>>>>>>>> /* is PF driver loaded w/callback */ >>>>>>>>> - if (!pdev->driver || !pdev->driver->sriov_configure) { >>>>>>>>> - pci_info(pdev, "Driver doesn't support SRIOV >>>>>>>>> configuration via sysfs\n"); >>>>>>>>> - ret = -ENOENT; >>>>>>>>> - goto exit; >>>>>>>>> - } >>>>>>>>> + if (pdev->driver && pdev->driver->sriov_configure) >>>>>>>>> + sriov_configure = pdev->driver->sriov_configure; >>>>>>>>> + else >>>>>>>>> + sriov_configure = pci_sriov_configure_unmanaged; >>>>>>>> >>>>>>>> >>>>>>>> So an unwitting user is now able to enable vfs, independent of the >>>>>>>> pf... the trouble being that they're probably going to expect them to >>>>>>>> work and the more common case is that they won't. For instance, what >>>>>>>> can you do with an igbvf when igb isn't managing the pf? >>>>>>> >>>>>>> >>>>>>> Well the VFs wouldn't be able to do anything. Basically they would be >>>>>>> sitting there with no driver loaded on them unless they are assigned >>>>>>> to a guest, or the root user had enabled the unmanaged option. If you >>>>>>> did load a driver on it the VF would sit there with link down unless >>>>>>> either the PF driver is loaded or some user-space entity steps in to >>>>>>> start managing the PF. >>>>>>> >>>>>>> In reality this can already happen as last I recall igb and ixgbe were >>>>>>> already capable of having the PF driver removed when SR-IOV was >>>>>>> enabled and VFs were assigned. Basically the VFs just report link down >>>>>>> and don't do anything. Reloading the PF driver would have it take over >>>>>>> in the case of igb and ixgbe since they were designed to handle that >>>>>>> type of scenario. >>>>> >>>>> >>>>> I think only ixgbe behaves this way between the two, igb disables sriov >>>>> in its remove function unless we're hung up by that silly >>>>> pci_vfs_assigned() check, which doesn't apply to vfio assignment. >>>>> Regardless, yes, some pf drivers do leave sriov enabled on remove, >>>>> whether that's useful or reasonable is another question. >>>> >>>> >>>> Right. We can argue that another day. I was just sighting the igb >>>> behavior. >>>> >>>>>>>> Or what happens when vfio-pci owns the pf, sriov is enabled via the >>>>>>>> unmanaged interface, and the pf user driver segfaults and gets >>>>>>>> killed, >>>>>>>> causing vfio-pci to restore the pf state, including wiping the sriov >>>>>>>> config? >>>>>>> >>>>>>> >>>>>>> Wiping the config shouldn't have any effect on the allocated VF pci >>>>>>> devices. It will cause the VFs to act as though they have fallen off >>>>>>> of the bus though and the guests would see a surprise remove type >>>>>>> behavior if I am not mistaken. The MMIO and config accesses would fail >>>>>>> until the SR-IOV configuration is restored. Really this shouldn't be a >>>>>>> problem as long as the SR-IOV is enabled prior to loading the vfio-pci >>>>>>> driver as I would assume it would restore the state an re-enable >>>>>>> SR-IOV. >>>>> >>>>> >>>>> Nope, sriov does not appear to be part of the device state saved and >>>>> restored around reset, so I think we just end up in an inconsistent >>>>> state. Also, vfio-pci stores a copy of the saved state of a device >>>>> prior to giving the user access to the device and tries to restore that >>>>> copy when released by the user, so anything that's part of the save >>>>> state and modified via side-channels while the device is opened, would >>>>> still be lost. >>>> >>>> >>>> Well it is but it isn't. The pci_restore_state will call >>>> pci_restore_iov_state which will just repopulate the data straight out >>>> of the pdev->sriov structure. I think the assumption is we were >>>> already carrying enough information that there wasn't any point in >>>> saving the state since we could just restore it from that structure >>>> anyway. >>> >>> >>> Ah, right, I remember that now. >>> >>>> >>>> I've been chasing it down and can verify that much this morning after >>>> testing. A regular hit of the sysfs reset control will not erase the >>>> state. >>> >>> >>> Perhaps long term, but momentarily the vf disappears and that could >>> imply data loss, no? >>> >>>> One issue that I think I found that may be a bug though is if I >>>> load the vfio-pci driver on an interface and unbind it I am seeing the >>>> port state reset. I have it chased down to the idle D3 code. It looks >>>> like going from vfio_pci_probe to vfio_pci_remove is triggering the >>>> equivilent of the pci_pm_reset since it is cycling me through >>>> D0->D3->D0 without restoring state after the fact. I also verified >>>> that setting disable_idle_d3 resolves the issue. Would you have any >>>> complaints about me doing a save_state in the probe call, and a >>>> restore_state in the remove? It seems like that should probably be the >>>> easiest fix. >>> >>> >>> Hmm, perhaps I had assumed that pci_set_power_state() would handle such >>> things, but indeed I do see other drivers calling pci_save_state() >>> prior and pci_restore_state() after. I think we'd need to audit all of >>> vfio-pci's calls to pci_set_power_state(), not just probe and remove. >>> >>>>>>> In the grand scheme of things how would the situation you describe be >>>>>>> any different than someone using the "reset" sysfs control on the PF >>>>>>> while using SR-IOV with driver supported SR-IOV? >>>>> >>>>> >>>>> Similar, but the pf driver has enabled sriov and can register the >>>>> reset_done error handler to be notified to re-enable sriov or perform >>>>> other cleanup. If the pf driver is not participating in sriov, it >>>>> would seem exceptional to support this. Triggering reset via sysfs >>>>> also falls more closely to shooting yourself in the foot than I think >>>>> we want to design into driver/subsystem APIs. >>>> >>>> >>>> To some extent I agree with the shooting yourself in the foot. At the >>>> same time there isn't actually all that much to re-enable SR-IOV. >>>> Restoring the PCI SR-IOV configuration space and re-enabling the bus >>>> master enable. >>> >>> >>> Bus master is a pretty interesting example of the trivial control a pf >>> driver can exert on the vfs though. >>> >>>> The Amazon guys would probably know better than I since I haven't >>>> really worked much with one of these parts yet. Actually the virtio >>>> that Mark pushed may behave the same way too. As far as I know in >>>> these firmware cases the hardware itself has everything >>>> pre-partitioned and set to re-enable as soon as the SR-IOV bits are >>>> set. I think all they need is a few bits toggled and they are pretty >>>> much good to go. >>> >>> >>> Are they just looking for an sriov capable stub driver? With >>> increasing vf counts, being able to use something like vfio-pci on the >>> pf seems like all risk with statistically insignificant increase in >>> density. On the other hand, if there's a userspace pf management >>> driver, why not just make it trusted by adding it as a native host >>> kernel driver? If we're talking about tainting the host kernel to >>> enable this interaction, maybe it should just be tainted by an out of >>> tree, possibly non-gpl host pf driver anyway. There can't really be a >>> case where the pf would be used by an average user without some degree >>> of privilege or cooperation, right? >>> >>>>>>> >>>>>>> I suppose if you really wanted we could add a new call that you could >>>>>>> put into the sriov_configure pointer that would just make it always >>>>>>> return error. Then that way the configuration could be locked until >>>>>>> the driver is unloaded. >>>>>>> >>>>>>>> >>>>>>>> I guess I don't understand how vfs can operate fully independent of >>>>>>>> the >>>>>>>> pf and why vfio-pci wouldn't just implement a dummy sriov_configure >>>>>>>> to >>>>>>>> avoid contending with such issues. >>>>>>> >>>>>>> >>>>>>> This approach isn't without risk, but it isn't as if it is really a >>>>>>> new risk we are introducing, and we could do further work to help >>>>>>> smooth out issues like this. Much of this depends on the design of the >>>>>>> device and the drivers involved. What is happening in this case is >>>>>>> that the VFs are managed outside of the kernel itself either by some >>>>>>> user-space entity or by a firmware entity. >>>>> >>>>> >>>>> But we can't simply split vfs and pfs as separate resources, the vf is >>>>> dependent on the pf and that implies a privilege hierarchy, if not at >>>>> least cooperation. The pf driver can intentionally or unintentionally >>>>> disconnect the vfs at any point in time, possibly leading to data loss >>>>> or at least denial of service for the vf consumer. I also don't trust >>>>> that there aren't numerous sriov implementations where the pf isn't >>>>> privileged to the point of being able to snoop data from the vf. So >>>>> what sort of usage models are we enabling for the vf? A firmware owned >>>>> vf subject to these conditions seems mostly pointless. >>>> >>>> >>>> In the case of these sort of parts the PF isn't really all that >>>> privileged. Usually the PF is just a VF with the SR-IOV capability >>>> hanging off of it. I suspect the only thing that might outright >>>> control anything would be the bus master enable bit. Everything else >>>> could be independent. The PF driver in such cases doesn't do much. It >>>> is basically just the host for the configuration space. >>> >>> >>> This is sounding more like an sriov capable stub driver. Certainly not >>> all pfs behave the way you describe above, many are significantly more >>> privileged and even if not, bus master is a pretty trivial control >>> point. So we probably need a driver that claims devices known to >>> behave this way, or a meta driver that bind via dynamic IDs or >>> driver_override. The proposal here sort of covertly turns anything >>> that's not an sriov driver into that stub driver with no guarantee that >>> the subverted driver is a willing or safe participant. >>> >>>>>> So I was thinking about this some more. In the case of vfio-pci things >>>>>> are a bit different since you are essentially taking a given device >>>>>> and handing it to a VM or userspace and it doesn't guarantee a >>>>>> communication between the two. >>>>> >>>>> >>>>> Doesn't guarantee communication or cooperation or even trust. >>>> >>>> >>>> Right, but at the same time I consider this to be a shoot yourself in >>>> the foot type scenario. If you are going to hand off your PF while VFs >>>> are active then you are asking for whatever repercussions you end up >>>> getting. I've added a warning and a TAINT_USER flag to my code at this >>>> point if you load an active PF in vfio, and I have added a function >>>> that locks the setting so it cannot be changed once we place a PF in >>>> the control of vfio-pci. >>>> >>>> The way I see it there are two scenarios. One where the PF is just a >>>> VF with an extra bit of SR-IOV configuration space floating off of it. >>>> The other is where we want to have SR-IOV enabled and have some third >>>> party managing the PF. The first one doesn't really care about the >>>> communication and would prefer that whatever owns the driver on the PF >>>> just ignore the SR-IOV portion of the configuration space. The second >>>> one actually cares and would want some sort of >>>> communication/cooperation but for now I don't see that as being the >>>> easiest thing to do so it might be best to just let it see a fixed >>>> number of VFs it just has to work with that. >>> >>> >>> There's no such thing as a read-only sriov capability in the spec, >>> which is a problem we ran into a while ago, vfio-pci exposes the >>> capability, but protects it as read-only so the user cannot create >>> devices on the host. QEMU passed this through to the guest, but that >>> failed as soon as OVMF started supporting sriov as it's unable to size >>> the VF BAR resources. Now QEMU drops the sriov capability from the >>> guest capability chain. So it's not clear how this fixed number of vfs >>> plan works. Are we inventing our own capability for this? If the pf >>> driver is just a userspace driver and not a VM, maybe what we do now is >>> sufficient, otherwise there's no standard for exposing fixed vfs. >>> >>>>>> My thought is to look at making SR-IOV configuration static or treat >>>>>> it as read-only when the vfio-pci driver is loaded on a given >>>>>> interface. In addition I would set the TAINT_USER flag and add a >>>>>> warning about loading vfio-pci on an active PF, and provide the number >>>>>> of VFs that were allocated. >>>>>> >>>>>> The idea with all of this being that we would at least have a partial >>>>>> lock on all of this so that you can allocate some number of VFs, and >>>>>> then start partitioning things up where you could assign the PF and >>>>>> VFs as needed to the various interfaces. Once the PF is assigned to >>>>>> the vfio-pci driver it would be locked in terms of the number of VFs >>>>>> that are provided so there would be no need for an extra communication >>>>>> channel between the PF driver and the host to deal with changes. >>>>>> >>>>>> Thoughts? >>>>> >>>>> >>>>> Is an sriov configuration ever static? vfio is the most prolific user >>>>> of the pci reset interfaces, AFAIK. Even if we add support for >>>>> restoring the sriov configuration and even if the pf user isn't trying >>>>> to be malicious, vf users would need to be prepared for their device >>>>> arbitrarily dropping off the bus, mmio randomly (from their >>>>> perspective) being disabled, and maybe other side-effects of the user >>>>> just trying to use the pf device. Is there even any guarantee that a >>>>> pf driver can operate identically, regardless of the state of sriov on >>>>> the pf? It seems like the pf driver needs to be aware of which queues >>>>> are available for itself vs the vfs in some designs. Is there still >>>>> long term value in a solution for this if the kernel is tainted? >>>> >>>> >>>> As far as the device randomly resetting I don't really see how that is >>>> any different from what we already have in terms of solutions >>>> including stuff supported by in-driver SR-IOV. The Intel drivers are >>>> notorious for resetting for any little thing like an MTU change. Us >>>> resetting due to someone calling open/release on an interface wouldn't >>>> be anything new. In addition AER can always hit us too. Yes we don't >>>> have the same recovery mechanisms in place, but in the case of >>>> something such as this we probably don't need that complex of a >>>> recovery setup. >>> >>> >>> [I think below paragraph is specifically answering last question above, >>> still long term value in kernel tainting solution] >>> >>>> I would think there probably is. If not we wouldn't have gotten the >>>> patches earlier that were still doing the tainting and warning and >>>> making use of the vfio-pci driver to enable SR-IOV. I suspect the use >>>> case for all this is to enable SR-IOV, setup the PF in vfio-pci, and >>>> then assign VFs into and out of guests. I don't see the PF doing a lot >>>> of moving after it is setup. The taint flag only really applies if >>>> someone is looking for support and quite honestly I figure for now the >>>> USER flag is appropriate for this kind of thing since we are deferring >>>> all the networking control to the PF which is managed by userspace. >>> >>> >>> If we're going to throw up our hands and taint the kernel, then maybe >>> we could entertain the idea of doing that in a trivial vfio-pci >>> sriov_configure function. But does that really meet all the use cases >>> and what's the advantage of that vs Amazon (I guess they're the driver >>> here) tainting the kernel with their own driver? Mellanox has tried to >>> enable sriov in vfio-pci in the past, Cc Ilya. >>> >>>>> Minimally, it seems like the pf driver (the user driver, not vfio-pci) >>>>> needs to be a cooperating party in this exchange. How do we determine >>>>> that and what QoS guarantees are they providing by agreeing to it? >>>> >>>> >>>> I think the general idea here is that the user has already tested this >>>> out and determined for themselves that the user driver does what they >>>> want/need. If it didn't they wouldn't be going to all these lengths to >>>> set this up. >>>> >>>>> Otherwise it doesn't seem like sriov provides the sort of static, hard >>>>> partitioning of the device that you're asking for. A vf is a piece of >>>>> the pf and we've released ownership and control of the pf to a user. >>>>> Thanks, >>>>> >>>>> Alex >>>> >>>> >>>> I am pretty sure that you are describing is true of some, but not for >>>> all. I think the Amazon solutions and the virtio solution are doing >>>> hard partitioning of the part. I will leave it to those guys to speak >>>> for themselves since I don't know anything about the hardware design >>>> of those parts. >>> >>> >>> I think we'd need device specific knowledge and enablement to be able >>> to take advantage of any hardware partitioning, otherwise we need to >>> assume the pf is privileged, as implemented in other sriov devices. >>> >>> I'm also trying to imagine whether there's a solution via the new >>> vfio/mdev interface, where the mdev vendor driver would bind to the pf >>> and effectively present itself as the mdev device. The vendor driver >>> could provide sriov capabilities and bear the burden of ensuring that >>> the pf is used cooperatively. The only existing mdev vendor drivers are >>> vGPUs and rely on on-device DMA translation and isolation, such as >>> through GTTs, but there have been some thoughts on providing IOMMU based >>> isolation of mdev/sriov mixed devices (assuming DMA is even required >>> for userspace management of the pf in this use case). [Cc Kirti] >>> Thanks, >>> >>> Alex >>> >> Apologies for getting to this party late. >> >> My 2 cents: >> >> I think a stub driver is needed for security reasons. >> Multiple reasons/cases stated in the thread to (continue to) follow that >> model. >> ... and wondering why pci-stub isn't enhanced to do just that, since >> it's already used in device-assignment tasks (VM or DPDK). > > There was an attempt to update pci-stub: > https://patchwork.kernel.org/patch/10109935/ > > One issue with using pci-stub is it doesn't address the security > issues needed if an IOMMU is enabled. For that reason I think vfio-pci > might be a better choice, and that is why my v2 is based around it. > Sorry, not seeing why pci-stub has an IOMMU issue and vfio-pci does not, at least from a security point of view. vfio-pci is far-more savvy wrt IOMMU groups, and that can be viewed as a security check, so if that's what you mean, I'll buy that vowel. Otherwise, plse shine (more) light on yonder window ... > For my v2 of this patch set I dropped the pure generic solution and > instead just updated virtio and vfio-pci in order to meet the needs as > called out in the previous patch sets. > virtio is needed? ... More stuff I need to dig into... >> -- pf devices that have drivers that coordinate don't need pci-stub; >> pf devices w/no drivers get assigned to pci-stub, then VF's are >> enabled/disabled/tracked/managed.... > > Yeah, my first attempt at this was to avoid doing what I consider to > be somewhat ridiculous things like adding SR-IOV support to > virtio_net. It didn't seem right to me but in the grand scheme of > things that ends up kind of being the way we have to go in order to > put some boundaries on when/where SR-IOV can be enabled. > It seems to me we are bending backwards for an unprivileged, PF driver use case that should is the muck of doing all of this... IMO... :-p >> mdev wrapping of VFs doesn't (re)solve device-specific needs/extensions >> for VF migration either. >> PF-device-specific needs should be handled by a PF driver; >> if no PF driver, then assuming no device-specific need exists, and using >> non-device-specific, common pci-stub VF handling > > The issue here is how do you classify that there is no common PF > driver, or is this something that would be decided by the user? > pci bus code checks *after* full bus scan & driver-matching/load if a (PF) pci-dev has VF's (sriov cfg block), and no PF driver attached. if some (I don't agree should exist -- see IMO above) user-space PF wants to be used, unconfigure the vfio-pci driver from the PF first (another simple check). Not a priviledged user?.... ah, another reason _not_ to allow userspace PF. > Also I don't know if this addresses the needs fro the DPDK crowd to > enable a userspace PF driver? > and why do we suddenly think that an (unprivileged) userspace PF driver in the *host* (not a bounded/controlled guest VM) should be allowed to control a device? ... because DPDK is 'a good citizen'? bug free?!? Sorry, IMO, make a PF stub for VF assignment. >> -- even w/pci-quirk hooks/workarounds if they need to be added. >> Finally, if an ah-ha moment is reached that concludes with the >> need for a PF driver to handle device-specific issues, the pci-stub VF >> functions may provide the guide/framework for such hastily put together PF >> drivers, >> minimizing PF driver variances for sriov handling. > > I'm not sure I followed this part. Are you saying the pci-stub driver > gets used as a template essentially for creating a PF driver if > needed? That's what I was eluding to. Help those who think a PF stub driver for the kernel is so onerous. ... and no, I don't agree with the idea of a DPDK PF driver being asynchronously built and used wrt a kernel (version). IMO, that's another fault in that strategy. > >> Although a temporary disconnect is handled/emulated well in the (e)net >> space, >> I also don't think it's an issue for storage, as I would expect two >> situations to occur >> wrt data-loss scenarios: >> a) a PF device has a kernel driver, and it has to handle the up/down of a VF >> b) the PF is quiesced -- no device-specific kernel driver -- >> and there is no VF loss due to PF resetting b/c there >> is no need to reset PF -- pci-stub providing stable, kernel config. >> Per VF data loss is always an issue for storage interconnects >> (unless they are net-based, like iSCSI, RDMA, etc.) which is handled >> at the VF driver or layer above it. >> >> - Don > > My concern is we are starting to see a third option here. A PF that > has a kernel driver, but knows nothing of SR-IOV as it is completely > managed by firmware. An example being the virtio_net that somehow > knows how to do SR-IOV (https://patchwork.kernel.org/patch/10241225/). > My understanding is that the part in question is providing a > lightweight PF that is essentially just another VF, but has some extra > PCIe configuration space to it allowing for SR-IOV and AER > functionality. So, give it a light-wt PF driver. Maybe I'd appreciate this issue better if I understood how 'managed by firmware' is implemented. > > In addition we are also seeing customer scenarios where things like > DPDK are running as the PF with kernel VFs. Ideally we want to contain > that as much as possible and my understanding is that igb_uio > currently isn't really providing much in the way of containment. uh hum... q.e.d.! > > - Alex > IMO, the tail(DPDK; fw?) is wagging the dog(kernel) here. Either, we can add a light-wt, PF stub driver that does VF assignment for a conforming PF design (and just keep adding vid/did's to it over time), or that design is taken, tweaked and a new PF 'lightweight driver' is created to handle the hacking/workaround that needs to be done for a PF. 'Device-assignment' (properly) assumes the PF is in the control/host domain, and VFs are 'given' to a non-privileged, managed domain (managed by the control/host domain). I don't see how you maintain security if the PF isn't in the control/host domain.
On Tue, Mar 6, 2018 at 12:19 PM, Don Dutile <ddutile@redhat.com> wrote: > On 03/05/2018 04:41 PM, Alexander Duyck wrote: >> >> On Mon, Mar 5, 2018 at 12:57 PM, Don Dutile <ddutile@redhat.com> wrote: >>> >>> On 03/01/2018 03:22 PM, Alex Williamson wrote: >>>> >>>> >>>> On Wed, 28 Feb 2018 16:36:38 -0800 >>>> Alexander Duyck <alexander.duyck@gmail.com> wrote: >>>> >>>>> On Wed, Feb 28, 2018 at 2:59 PM, Alex Williamson >>>>> <alex.williamson@redhat.com> wrote: >>>>>> >>>>>> >>>>>> On Wed, 28 Feb 2018 09:49:21 -0800 >>>>>> Alexander Duyck <alexander.duyck@gmail.com> wrote: >>>>>> >>>>>>> >>>>>>> On Tue, Feb 27, 2018 at 2:25 PM, Alexander Duyck >>>>>>> <alexander.duyck@gmail.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Feb 27, 2018 at 1:40 PM, Alex Williamson >>>>>>>> <alex.williamson@redhat.com> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, 27 Feb 2018 11:06:54 -0800 >>>>>>>>> Alexander Duyck <alexander.duyck@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> From: Alexander Duyck <alexander.h.duyck@intel.com> >>>>>>>>>> >>>>>>>>>> This patch is meant to add support for SR-IOV on devices when the >>>>>>>>>> VFs are >>>>>>>>>> not managed by the kernel. Examples of recent patches attempting >>>>>>>>>> to >>>>>>>>>> do this >>>>>>>>>> include: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> It appears to enable sriov when the _pf_ is not managed by the >>>>>>>>> kernel, but by "managed" we mean that either there is no pf driver >>>>>>>>> or >>>>>>>>> the pf driver doesn't provide an sriov_configure callback, >>>>>>>>> intentionally or otherwise. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> virto - https://patchwork.kernel.org/patch/10241225/ >>>>>>>>>> pci-stub - https://patchwork.kernel.org/patch/10109935/ >>>>>>>>>> vfio - https://patchwork.kernel.org/patch/10103353/ >>>>>>>>>> uio - https://patchwork.kernel.org/patch/9974031/ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> So is the goal to get around the issues with enabling sriov on each >>>>>>>>> of >>>>>>>>> the above drivers by doing it under the covers or are you really >>>>>>>>> just >>>>>>>>> trying to enable sriov for a truly unmanage (no pf driver) case? >>>>>>>>> For >>>>>>>>> example, should a driver explicitly not wanting sriov enabled >>>>>>>>> implement >>>>>>>>> a dummy sriov_configure function? >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Since this is quickly blowing up into a multi-driver problem it is >>>>>>>>>> probably >>>>>>>>>> best to implement this solution in one spot. >>>>>>>>>> >>>>>>>>>> This patch is an attempt to do that. What we do with this patch is >>>>>>>>>> provide >>>>>>>>>> a generic call to enable SR-IOV in the case that the PF driver is >>>>>>>>>> either >>>>>>>>>> not present, or the PF driver doesn't support configuring SR-IOV. >>>>>>>>>> >>>>>>>>>> A new sysfs value called sriov_unmanaged_autoprobe has been added. >>>>>>>>>> This >>>>>>>>>> value is used as the drivers_autoprobe setting of the VFs when >>>>>>>>>> they >>>>>>>>>> are >>>>>>>>>> being managed by an external entity such as userspace or device >>>>>>>>>> firmware >>>>>>>>>> instead of being managed by the kernel. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Documentation/ABI/testing/sysfs-bus-pci update is missing. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I can make sure to update that in the next version. >>>>>>>> >>>>>>>>>> >>>>>>>>>> One side effect of this change is that the sriov_drivers_autoprobe >>>>>>>>>> and >>>>>>>>>> sriov_unmanaged_autoprobe will only apply their updates when >>>>>>>>>> SR-IOV >>>>>>>>>> is >>>>>>>>>> disabled. Attempts to update them when SR-IOV is in use will only >>>>>>>>>> update >>>>>>>>>> the local value and will not update sriov->autoprobe. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> And we expect users to understand when sriov_drivers_autoprobe >>>>>>>>> applies >>>>>>>>> vs sriov_unmanaged_autoprobe, even though they're using the same >>>>>>>>> interfaces to enable sriov? Are all combinations expected to work, >>>>>>>>> ex. >>>>>>>>> unmanaged sriov is enabled, a native pf driver loads, vfs work? >>>>>>>>> Not >>>>>>>>> only does it seems like there's opportunity to use this >>>>>>>>> incorrectly, >>>>>>>>> I >>>>>>>>> think maybe it might be difficult to use correctly. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> I based my patch set originally on the patch by Mark Rustad but >>>>>>>>>> there isn't >>>>>>>>>> much left after going through and cleaning out the bits that were >>>>>>>>>> no >>>>>>>>>> longer >>>>>>>>>> needed, and after incorporating the feedback from David Miller. >>>>>>>>>> >>>>>>>>>> I have included the authors of the original 4 patches above in the >>>>>>>>>> Cc here. >>>>>>>>>> My hope is to get feedback and/or review on if this works for >>>>>>>>>> their >>>>>>>>>> use >>>>>>>>>> cases. >>>>>>>>>> >>>>>>>>>> Cc: Mark Rustad <mark.d.rustad@intel.com> >>>>>>>>>> Cc: Maximilian Heyne <mheyne@amazon.de> >>>>>>>>>> Cc: Liang-Min Wang <liang-min.wang@intel.com> >>>>>>>>>> Cc: David Woodhouse <dwmw@amazon.co.uk> >>>>>>>>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >>>>>>>>>> --- >>>>>>>>>> drivers/pci/iov.c | 27 +++++++++++++++++++- >>>>>>>>>> drivers/pci/pci-driver.c | 2 + >>>>>>>>>> drivers/pci/pci-sysfs.c | 62 >>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++----- >>>>>>>>>> drivers/pci/pci.h | 4 ++- >>>>>>>>>> include/linux/pci.h | 1 + >>>>>>>>>> 5 files changed, 86 insertions(+), 10 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c >>>>>>>>>> index 677924ae0350..7b8858bd8d03 100644 >>>>>>>>>> --- a/drivers/pci/iov.c >>>>>>>>>> +++ b/drivers/pci/iov.c >>>>>>>>>> @@ -446,6 +446,7 @@ static int sriov_init(struct pci_dev *dev, int >>>>>>>>>> pos) >>>>>>>>>> pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, >>>>>>>>>> &iov->vf_device); >>>>>>>>>> iov->pgsz = pgsz; >>>>>>>>>> iov->self = dev; >>>>>>>>>> + iov->autoprobe = true; >>>>>>>>>> iov->drivers_autoprobe = true; >>>>>>>>>> pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, >>>>>>>>>> &iov->cap); >>>>>>>>>> pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, >>>>>>>>>> &iov->link); >>>>>>>>>> @@ -643,8 +644,11 @@ void pci_restore_iov_state(struct pci_dev >>>>>>>>>> *dev) >>>>>>>>>> */ >>>>>>>>>> void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool >>>>>>>>>> auto_probe) >>>>>>>>>> { >>>>>>>>>> - if (dev->is_physfn) >>>>>>>>>> + if (dev->is_physfn) { >>>>>>>>>> dev->sriov->drivers_autoprobe = auto_probe; >>>>>>>>>> + if (!dev->sriov->num_VFs) >>>>>>>>>> + dev->sriov->autoprobe = auto_probe; >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Why is dev->sriov->autoprobe set any time other than immediately >>>>>>>>> prior >>>>>>>>> to enabling VFs? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> My concern here was drivers that are still floating around with the >>>>>>>> old module parameter option for enabling SR-IOV. In the unlikely >>>>>>>> event >>>>>>>> that somebody was to use such a driver I wanted to make certain that >>>>>>>> the drivers_autoprobe state was pre-populated. >>>>>> >>>>>> >>>>>> >>>>>> Good point, but maybe that just means we should be setting it in >>>>>> sriov_enable()? >>>>> >>>>> >>>>> >>>>> I suppose we could. I would just have to check and see if we have any >>>>> drivers lurking out there that are supporting SR-IOV without >>>>> supporting the sysfs approach. As long as that is the case we could >>>>> probably put it there. >>>>> >>>>>>>>>> + } >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> /** >>>>>>>>>> @@ -703,6 +707,27 @@ void pci_disable_sriov(struct pci_dev *dev) >>>>>>>>>> EXPORT_SYMBOL_GPL(pci_disable_sriov); >>>>>>>>>> >>>>>>>>>> /** >>>>>>>>>> + * pci_sriov_configure_unmanaged - helper to configure unmanaged >>>>>>>>>> SR-IOV >>>>>>>>>> + * @dev: the PCI device >>>>>>>>>> + * @nr_virtfn: number of virtual functions to enable, 0 to >>>>>>>>>> disable >>>>>>>>>> + * >>>>>>>>>> + * Used to provide generic enable/disable SR-IOV option for >>>>>>>>>> devices >>>>>>>>>> + * that do not manage the VFs generated by their driver, or have >>>>>>>>>> no >>>>>>>>>> + * driver present. >>>>>>>>>> + */ >>>>>>>>>> +int pci_sriov_configure_unmanaged(struct pci_dev *dev, int >>>>>>>>>> nr_virtfn) >>>>>>>>>> +{ >>>>>>>>>> + int rc = 0; >>>>>>>>>> + >>>>>>>>>> + if (!nr_virtfn) >>>>>>>>>> + pci_disable_sriov(dev); >>>>>>>>>> + else >>>>>>>>>> + rc = pci_enable_sriov(dev, nr_virtfn); >>>>>>>>>> + >>>>>>>>>> + return rc ? rc : nr_virtfn; >>>>>>>>>> +} >>>>>>>>>> + >>>>>>>>>> +/** >>>>>>>>>> * pci_num_vf - return number of VFs associated with a PF >>>>>>>>>> device_release_driver >>>>>>>>>> * @dev: the PCI device >>>>>>>>>> * >>>>>>>>>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c >>>>>>>>>> index 3bed6beda051..2cc68dff6130 100644 >>>>>>>>>> --- a/drivers/pci/pci-driver.c >>>>>>>>>> +++ b/drivers/pci/pci-driver.c >>>>>>>>>> @@ -398,7 +398,7 @@ void __weak pcibios_free_irq(struct pci_dev >>>>>>>>>> *dev) >>>>>>>>>> #ifdef CONFIG_PCI_IOV >>>>>>>>>> static inline bool pci_device_can_probe(struct pci_dev *pdev) >>>>>>>>>> { >>>>>>>>>> - return (!pdev->is_virtfn || >>>>>>>>>> pdev->physfn->sriov->drivers_autoprobe); >>>>>>>>>> + return (!pdev->is_virtfn || pdev->physfn->sriov->autoprobe); >>>>>>>>>> } >>>>>>>>>> #else >>>>>>>>>> static inline bool pci_device_can_probe(struct pci_dev *pdev) >>>>>>>>>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c >>>>>>>>>> index eb6bee8724cc..e701b6dbc267 100644 >>>>>>>>>> --- a/drivers/pci/pci-sysfs.c >>>>>>>>>> +++ b/drivers/pci/pci-sysfs.c >>>>>>>>>> @@ -605,6 +605,7 @@ static ssize_t sriov_numvfs_store(struct >>>>>>>>>> device >>>>>>>>>> *dev, >>>>>>>>>> struct device_attribute *attr, >>>>>>>>>> const char *buf, size_t count) >>>>>>>>>> { >>>>>>>>>> + int (*sriov_configure)(struct pci_dev *dev, int num_vfs); >>>>>>>>>> struct pci_dev *pdev = to_pci_dev(dev); >>>>>>>>>> int ret; >>>>>>>>>> u16 num_vfs; >>>>>>>>>> @@ -622,15 +623,20 @@ static ssize_t sriov_numvfs_store(struct >>>>>>>>>> device *dev, >>>>>>>>>> goto exit; >>>>>>>>>> >>>>>>>>>> /* is PF driver loaded w/callback */ >>>>>>>>>> - if (!pdev->driver || !pdev->driver->sriov_configure) { >>>>>>>>>> - pci_info(pdev, "Driver doesn't support SRIOV >>>>>>>>>> configuration via sysfs\n"); >>>>>>>>>> - ret = -ENOENT; >>>>>>>>>> - goto exit; >>>>>>>>>> - } >>>>>>>>>> + if (pdev->driver && pdev->driver->sriov_configure) >>>>>>>>>> + sriov_configure = pdev->driver->sriov_configure; >>>>>>>>>> + else >>>>>>>>>> + sriov_configure = pci_sriov_configure_unmanaged; >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> So an unwitting user is now able to enable vfs, independent of the >>>>>>>>> pf... the trouble being that they're probably going to expect them >>>>>>>>> to >>>>>>>>> work and the more common case is that they won't. For instance, >>>>>>>>> what >>>>>>>>> can you do with an igbvf when igb isn't managing the pf? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Well the VFs wouldn't be able to do anything. Basically they would >>>>>>>> be >>>>>>>> sitting there with no driver loaded on them unless they are assigned >>>>>>>> to a guest, or the root user had enabled the unmanaged option. If >>>>>>>> you >>>>>>>> did load a driver on it the VF would sit there with link down unless >>>>>>>> either the PF driver is loaded or some user-space entity steps in to >>>>>>>> start managing the PF. >>>>>>>> >>>>>>>> In reality this can already happen as last I recall igb and ixgbe >>>>>>>> were >>>>>>>> already capable of having the PF driver removed when SR-IOV was >>>>>>>> enabled and VFs were assigned. Basically the VFs just report link >>>>>>>> down >>>>>>>> and don't do anything. Reloading the PF driver would have it take >>>>>>>> over >>>>>>>> in the case of igb and ixgbe since they were designed to handle that >>>>>>>> type of scenario. >>>>>> >>>>>> >>>>>> >>>>>> I think only ixgbe behaves this way between the two, igb disables >>>>>> sriov >>>>>> in its remove function unless we're hung up by that silly >>>>>> pci_vfs_assigned() check, which doesn't apply to vfio assignment. >>>>>> Regardless, yes, some pf drivers do leave sriov enabled on remove, >>>>>> whether that's useful or reasonable is another question. >>>>> >>>>> >>>>> >>>>> Right. We can argue that another day. I was just sighting the igb >>>>> behavior. >>>>> >>>>>>>>> Or what happens when vfio-pci owns the pf, sriov is enabled via the >>>>>>>>> unmanaged interface, and the pf user driver segfaults and gets >>>>>>>>> killed, >>>>>>>>> causing vfio-pci to restore the pf state, including wiping the >>>>>>>>> sriov >>>>>>>>> config? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Wiping the config shouldn't have any effect on the allocated VF pci >>>>>>>> devices. It will cause the VFs to act as though they have fallen off >>>>>>>> of the bus though and the guests would see a surprise remove type >>>>>>>> behavior if I am not mistaken. The MMIO and config accesses would >>>>>>>> fail >>>>>>>> until the SR-IOV configuration is restored. Really this shouldn't be >>>>>>>> a >>>>>>>> problem as long as the SR-IOV is enabled prior to loading the >>>>>>>> vfio-pci >>>>>>>> driver as I would assume it would restore the state an re-enable >>>>>>>> SR-IOV. >>>>>> >>>>>> >>>>>> >>>>>> Nope, sriov does not appear to be part of the device state saved and >>>>>> restored around reset, so I think we just end up in an inconsistent >>>>>> state. Also, vfio-pci stores a copy of the saved state of a device >>>>>> prior to giving the user access to the device and tries to restore >>>>>> that >>>>>> copy when released by the user, so anything that's part of the save >>>>>> state and modified via side-channels while the device is opened, would >>>>>> still be lost. >>>>> >>>>> >>>>> >>>>> Well it is but it isn't. The pci_restore_state will call >>>>> pci_restore_iov_state which will just repopulate the data straight out >>>>> of the pdev->sriov structure. I think the assumption is we were >>>>> already carrying enough information that there wasn't any point in >>>>> saving the state since we could just restore it from that structure >>>>> anyway. >>>> >>>> >>>> >>>> Ah, right, I remember that now. >>>> >>>>> >>>>> I've been chasing it down and can verify that much this morning after >>>>> testing. A regular hit of the sysfs reset control will not erase the >>>>> state. >>>> >>>> >>>> >>>> Perhaps long term, but momentarily the vf disappears and that could >>>> imply data loss, no? >>>> >>>>> One issue that I think I found that may be a bug though is if I >>>>> load the vfio-pci driver on an interface and unbind it I am seeing the >>>>> port state reset. I have it chased down to the idle D3 code. It looks >>>>> like going from vfio_pci_probe to vfio_pci_remove is triggering the >>>>> equivilent of the pci_pm_reset since it is cycling me through >>>>> D0->D3->D0 without restoring state after the fact. I also verified >>>>> that setting disable_idle_d3 resolves the issue. Would you have any >>>>> complaints about me doing a save_state in the probe call, and a >>>>> restore_state in the remove? It seems like that should probably be the >>>>> easiest fix. >>>> >>>> >>>> >>>> Hmm, perhaps I had assumed that pci_set_power_state() would handle such >>>> things, but indeed I do see other drivers calling pci_save_state() >>>> prior and pci_restore_state() after. I think we'd need to audit all of >>>> vfio-pci's calls to pci_set_power_state(), not just probe and remove. >>>> >>>>>>>> In the grand scheme of things how would the situation you describe >>>>>>>> be >>>>>>>> any different than someone using the "reset" sysfs control on the PF >>>>>>>> while using SR-IOV with driver supported SR-IOV? >>>>>> >>>>>> >>>>>> >>>>>> Similar, but the pf driver has enabled sriov and can register the >>>>>> reset_done error handler to be notified to re-enable sriov or perform >>>>>> other cleanup. If the pf driver is not participating in sriov, it >>>>>> would seem exceptional to support this. Triggering reset via sysfs >>>>>> also falls more closely to shooting yourself in the foot than I think >>>>>> we want to design into driver/subsystem APIs. >>>>> >>>>> >>>>> >>>>> To some extent I agree with the shooting yourself in the foot. At the >>>>> same time there isn't actually all that much to re-enable SR-IOV. >>>>> Restoring the PCI SR-IOV configuration space and re-enabling the bus >>>>> master enable. >>>> >>>> >>>> >>>> Bus master is a pretty interesting example of the trivial control a pf >>>> driver can exert on the vfs though. >>>> >>>>> The Amazon guys would probably know better than I since I haven't >>>>> really worked much with one of these parts yet. Actually the virtio >>>>> that Mark pushed may behave the same way too. As far as I know in >>>>> these firmware cases the hardware itself has everything >>>>> pre-partitioned and set to re-enable as soon as the SR-IOV bits are >>>>> set. I think all they need is a few bits toggled and they are pretty >>>>> much good to go. >>>> >>>> >>>> >>>> Are they just looking for an sriov capable stub driver? With >>>> increasing vf counts, being able to use something like vfio-pci on the >>>> pf seems like all risk with statistically insignificant increase in >>>> density. On the other hand, if there's a userspace pf management >>>> driver, why not just make it trusted by adding it as a native host >>>> kernel driver? If we're talking about tainting the host kernel to >>>> enable this interaction, maybe it should just be tainted by an out of >>>> tree, possibly non-gpl host pf driver anyway. There can't really be a >>>> case where the pf would be used by an average user without some degree >>>> of privilege or cooperation, right? >>>> >>>>>>>> >>>>>>>> I suppose if you really wanted we could add a new call that you >>>>>>>> could >>>>>>>> put into the sriov_configure pointer that would just make it always >>>>>>>> return error. Then that way the configuration could be locked until >>>>>>>> the driver is unloaded. >>>>>>>> >>>>>>>>> >>>>>>>>> I guess I don't understand how vfs can operate fully independent of >>>>>>>>> the >>>>>>>>> pf and why vfio-pci wouldn't just implement a dummy sriov_configure >>>>>>>>> to >>>>>>>>> avoid contending with such issues. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> This approach isn't without risk, but it isn't as if it is really a >>>>>>>> new risk we are introducing, and we could do further work to help >>>>>>>> smooth out issues like this. Much of this depends on the design of >>>>>>>> the >>>>>>>> device and the drivers involved. What is happening in this case is >>>>>>>> that the VFs are managed outside of the kernel itself either by some >>>>>>>> user-space entity or by a firmware entity. >>>>>> >>>>>> >>>>>> >>>>>> But we can't simply split vfs and pfs as separate resources, the vf is >>>>>> dependent on the pf and that implies a privilege hierarchy, if not at >>>>>> least cooperation. The pf driver can intentionally or unintentionally >>>>>> disconnect the vfs at any point in time, possibly leading to data loss >>>>>> or at least denial of service for the vf consumer. I also don't trust >>>>>> that there aren't numerous sriov implementations where the pf isn't >>>>>> privileged to the point of being able to snoop data from the vf. So >>>>>> what sort of usage models are we enabling for the vf? A firmware >>>>>> owned >>>>>> vf subject to these conditions seems mostly pointless. >>>>> >>>>> >>>>> >>>>> In the case of these sort of parts the PF isn't really all that >>>>> privileged. Usually the PF is just a VF with the SR-IOV capability >>>>> hanging off of it. I suspect the only thing that might outright >>>>> control anything would be the bus master enable bit. Everything else >>>>> could be independent. The PF driver in such cases doesn't do much. It >>>>> is basically just the host for the configuration space. >>>> >>>> >>>> >>>> This is sounding more like an sriov capable stub driver. Certainly not >>>> all pfs behave the way you describe above, many are significantly more >>>> privileged and even if not, bus master is a pretty trivial control >>>> point. So we probably need a driver that claims devices known to >>>> behave this way, or a meta driver that bind via dynamic IDs or >>>> driver_override. The proposal here sort of covertly turns anything >>>> that's not an sriov driver into that stub driver with no guarantee that >>>> the subverted driver is a willing or safe participant. >>>> >>>>>>> So I was thinking about this some more. In the case of vfio-pci >>>>>>> things >>>>>>> are a bit different since you are essentially taking a given device >>>>>>> and handing it to a VM or userspace and it doesn't guarantee a >>>>>>> communication between the two. >>>>>> >>>>>> >>>>>> >>>>>> Doesn't guarantee communication or cooperation or even trust. >>>>> >>>>> >>>>> >>>>> Right, but at the same time I consider this to be a shoot yourself in >>>>> the foot type scenario. If you are going to hand off your PF while VFs >>>>> are active then you are asking for whatever repercussions you end up >>>>> getting. I've added a warning and a TAINT_USER flag to my code at this >>>>> point if you load an active PF in vfio, and I have added a function >>>>> that locks the setting so it cannot be changed once we place a PF in >>>>> the control of vfio-pci. >>>>> >>>>> The way I see it there are two scenarios. One where the PF is just a >>>>> VF with an extra bit of SR-IOV configuration space floating off of it. >>>>> The other is where we want to have SR-IOV enabled and have some third >>>>> party managing the PF. The first one doesn't really care about the >>>>> communication and would prefer that whatever owns the driver on the PF >>>>> just ignore the SR-IOV portion of the configuration space. The second >>>>> one actually cares and would want some sort of >>>>> communication/cooperation but for now I don't see that as being the >>>>> easiest thing to do so it might be best to just let it see a fixed >>>>> number of VFs it just has to work with that. >>>> >>>> >>>> >>>> There's no such thing as a read-only sriov capability in the spec, >>>> which is a problem we ran into a while ago, vfio-pci exposes the >>>> capability, but protects it as read-only so the user cannot create >>>> devices on the host. QEMU passed this through to the guest, but that >>>> failed as soon as OVMF started supporting sriov as it's unable to size >>>> the VF BAR resources. Now QEMU drops the sriov capability from the >>>> guest capability chain. So it's not clear how this fixed number of vfs >>>> plan works. Are we inventing our own capability for this? If the pf >>>> driver is just a userspace driver and not a VM, maybe what we do now is >>>> sufficient, otherwise there's no standard for exposing fixed vfs. >>>> >>>>>>> My thought is to look at making SR-IOV configuration static or treat >>>>>>> it as read-only when the vfio-pci driver is loaded on a given >>>>>>> interface. In addition I would set the TAINT_USER flag and add a >>>>>>> warning about loading vfio-pci on an active PF, and provide the >>>>>>> number >>>>>>> of VFs that were allocated. >>>>>>> >>>>>>> The idea with all of this being that we would at least have a partial >>>>>>> lock on all of this so that you can allocate some number of VFs, and >>>>>>> then start partitioning things up where you could assign the PF and >>>>>>> VFs as needed to the various interfaces. Once the PF is assigned to >>>>>>> the vfio-pci driver it would be locked in terms of the number of VFs >>>>>>> that are provided so there would be no need for an extra >>>>>>> communication >>>>>>> channel between the PF driver and the host to deal with changes. >>>>>>> >>>>>>> Thoughts? >>>>>> >>>>>> >>>>>> >>>>>> Is an sriov configuration ever static? vfio is the most prolific user >>>>>> of the pci reset interfaces, AFAIK. Even if we add support for >>>>>> restoring the sriov configuration and even if the pf user isn't trying >>>>>> to be malicious, vf users would need to be prepared for their device >>>>>> arbitrarily dropping off the bus, mmio randomly (from their >>>>>> perspective) being disabled, and maybe other side-effects of the user >>>>>> just trying to use the pf device. Is there even any guarantee that a >>>>>> pf driver can operate identically, regardless of the state of sriov on >>>>>> the pf? It seems like the pf driver needs to be aware of which queues >>>>>> are available for itself vs the vfs in some designs. Is there still >>>>>> long term value in a solution for this if the kernel is tainted? >>>>> >>>>> >>>>> >>>>> As far as the device randomly resetting I don't really see how that is >>>>> any different from what we already have in terms of solutions >>>>> including stuff supported by in-driver SR-IOV. The Intel drivers are >>>>> notorious for resetting for any little thing like an MTU change. Us >>>>> resetting due to someone calling open/release on an interface wouldn't >>>>> be anything new. In addition AER can always hit us too. Yes we don't >>>>> have the same recovery mechanisms in place, but in the case of >>>>> something such as this we probably don't need that complex of a >>>>> recovery setup. >>>> >>>> >>>> >>>> [I think below paragraph is specifically answering last question above, >>>> still long term value in kernel tainting solution] >>>> >>>>> I would think there probably is. If not we wouldn't have gotten the >>>>> patches earlier that were still doing the tainting and warning and >>>>> making use of the vfio-pci driver to enable SR-IOV. I suspect the use >>>>> case for all this is to enable SR-IOV, setup the PF in vfio-pci, and >>>>> then assign VFs into and out of guests. I don't see the PF doing a lot >>>>> of moving after it is setup. The taint flag only really applies if >>>>> someone is looking for support and quite honestly I figure for now the >>>>> USER flag is appropriate for this kind of thing since we are deferring >>>>> all the networking control to the PF which is managed by userspace. >>>> >>>> >>>> >>>> If we're going to throw up our hands and taint the kernel, then maybe >>>> we could entertain the idea of doing that in a trivial vfio-pci >>>> sriov_configure function. But does that really meet all the use cases >>>> and what's the advantage of that vs Amazon (I guess they're the driver >>>> here) tainting the kernel with their own driver? Mellanox has tried to >>>> enable sriov in vfio-pci in the past, Cc Ilya. >>>> >>>>>> Minimally, it seems like the pf driver (the user driver, not vfio-pci) >>>>>> needs to be a cooperating party in this exchange. How do we determine >>>>>> that and what QoS guarantees are they providing by agreeing to it? >>>>> >>>>> >>>>> >>>>> I think the general idea here is that the user has already tested this >>>>> out and determined for themselves that the user driver does what they >>>>> want/need. If it didn't they wouldn't be going to all these lengths to >>>>> set this up. >>>>> >>>>>> Otherwise it doesn't seem like sriov provides the sort of static, hard >>>>>> partitioning of the device that you're asking for. A vf is a piece of >>>>>> the pf and we've released ownership and control of the pf to a user. >>>>>> Thanks, >>>>>> >>>>>> Alex >>>>> >>>>> >>>>> >>>>> I am pretty sure that you are describing is true of some, but not for >>>>> all. I think the Amazon solutions and the virtio solution are doing >>>>> hard partitioning of the part. I will leave it to those guys to speak >>>>> for themselves since I don't know anything about the hardware design >>>>> of those parts. >>>> >>>> >>>> >>>> I think we'd need device specific knowledge and enablement to be able >>>> to take advantage of any hardware partitioning, otherwise we need to >>>> assume the pf is privileged, as implemented in other sriov devices. >>>> >>>> I'm also trying to imagine whether there's a solution via the new >>>> vfio/mdev interface, where the mdev vendor driver would bind to the pf >>>> and effectively present itself as the mdev device. The vendor driver >>>> could provide sriov capabilities and bear the burden of ensuring that >>>> the pf is used cooperatively. The only existing mdev vendor drivers are >>>> vGPUs and rely on on-device DMA translation and isolation, such as >>>> through GTTs, but there have been some thoughts on providing IOMMU based >>>> isolation of mdev/sriov mixed devices (assuming DMA is even required >>>> for userspace management of the pf in this use case). [Cc Kirti] >>>> Thanks, >>>> >>>> Alex >>>> >>> Apologies for getting to this party late. >>> >>> My 2 cents: >>> >>> I think a stub driver is needed for security reasons. >>> Multiple reasons/cases stated in the thread to (continue to) follow that >>> model. >>> ... and wondering why pci-stub isn't enhanced to do just that, since >>> it's already used in device-assignment tasks (VM or DPDK). >> >> >> There was an attempt to update pci-stub: >> https://patchwork.kernel.org/patch/10109935/ >> > > >> One issue with using pci-stub is it doesn't address the security >> issues needed if an IOMMU is enabled. For that reason I think vfio-pci >> might be a better choice, and that is why my v2 is based around it. >> > Sorry, not seeing why pci-stub has an IOMMU issue and vfio-pci does not, > at least from a security point of view. > vfio-pci is far-more savvy wrt IOMMU groups, and that can be viewed as a > security > check, so if that's what you mean, I'll buy that vowel. > Otherwise, plse shine (more) light on yonder window ... That is pretty much what I have in mind. Basically the vfio ends up being much more savvy in this regard and so it becomes the obvious choice for something like a user space PF driver. >> For my v2 of this patch set I dropped the pure generic solution and >> instead just updated virtio and vfio-pci in order to meet the needs as >> called out in the previous patch sets. >> > virtio is needed? ... > More stuff I need to dig into... Yeah. I believe that is patch 3 in the set is the one that enables virtio as we have a part that emulates virtio_net for the PF and VFs, but the PF interface is exposing the necessary SR-IOV PCIe configuration bits. >>> -- pf devices that have drivers that coordinate don't need pci-stub; >>> pf devices w/no drivers get assigned to pci-stub, then VF's are >>> enabled/disabled/tracked/managed.... >> >> >> Yeah, my first attempt at this was to avoid doing what I consider to >> be somewhat ridiculous things like adding SR-IOV support to >> virtio_net. It didn't seem right to me but in the grand scheme of >> things that ends up kind of being the way we have to go in order to >> put some boundaries on when/where SR-IOV can be enabled. >> > It seems to me we are bending backwards for an unprivileged, PF driver use > case > that should is the muck of doing all of this... IMO... :-p I think I understand what you are trying to say here. I'm beginning to think I may back away from the PF driver use case, and instead just focus on the virtio one for now. >>> mdev wrapping of VFs doesn't (re)solve device-specific needs/extensions >>> for VF migration either. >>> PF-device-specific needs should be handled by a PF driver; >>> if no PF driver, then assuming no device-specific need exists, and using >>> non-device-specific, common pci-stub VF handling >> >> >> The issue here is how do you classify that there is no common PF >> driver, or is this something that would be decided by the user? >> > pci bus code checks *after* full bus scan & driver-matching/load > if a (PF) pci-dev has VF's (sriov cfg block), and no PF driver attached. > if some (I don't agree should exist -- see IMO above) user-space PF wants > to be used, unconfigure the vfio-pci driver from the PF first (another > simple check). > Not a priviledged user?.... ah, another reason _not_ to allow userspace PF. The privileged user part isn't an issue I think, and there are already solutions out there doing this via igb_uio. >> Also I don't know if this addresses the needs fro the DPDK crowd to >> enable a userspace PF driver? >> > and why do we suddenly think that an (unprivileged) userspace PF driver in > the > *host* (not a bounded/controlled guest VM) should be allowed to control a > device? > ... because DPDK is 'a good citizen'? bug free?!? > Sorry, IMO, make a PF stub for VF assignment. I'm not arguing that is the case. That is one of the reasons why we taint the kernel as soon as the SR-IOV VFs are allocated. The fact is a userspace PF is bound to cause issues one way or another simply because it adds that much more complexity to the test matrix. In the case of the virtio device I know the device IDs which is why we did the stub for that since it is a firmware managed device. Are you arguing we need to fork VFIO and create a new PF stub driver based on that to support the SR-IOV use case? >>> -- even w/pci-quirk hooks/workarounds if they need to be added. >>> Finally, if an ah-ha moment is reached that concludes with the >>> need for a PF driver to handle device-specific issues, the pci-stub VF >>> functions may provide the guide/framework for such hastily put together >>> PF >>> drivers, >>> minimizing PF driver variances for sriov handling. >> >> >> I'm not sure I followed this part. Are you saying the pci-stub driver >> gets used as a template essentially for creating a PF driver if >> needed? > > That's what I was eluding to. Help those who think a PF stub driver for the > kernel is so onerous. > ... and no, I don't agree with the idea of a DPDK PF driver being > asynchronously built > and used wrt a kernel (version). IMO, that's another fault in that strategy. I think part of the issue is the fact that some of these designs are probably too flexible for their own good. In the example of the part that needs SR-IOV support on virtio_net the issue is that the PF is just another partition of the same logic that is supporting the VFs. As such whatever they decide to expose as a VF appears as a PF. I would imagine that could be problematic if the number of options gets to be wide enough. >> >>> Although a temporary disconnect is handled/emulated well in the (e)net >>> space, >>> I also don't think it's an issue for storage, as I would expect two >>> situations to occur >>> wrt data-loss scenarios: >>> a) a PF device has a kernel driver, and it has to handle the up/down of a >>> VF >>> b) the PF is quiesced -- no device-specific kernel driver -- >>> and there is no VF loss due to PF resetting b/c there >>> is no need to reset PF -- pci-stub providing stable, kernel config. >>> Per VF data loss is always an issue for storage interconnects >>> (unless they are net-based, like iSCSI, RDMA, etc.) which is handled >>> at the VF driver or layer above it. >>> >>> - Don >> >> >> My concern is we are starting to see a third option here. A PF that >> has a kernel driver, but knows nothing of SR-IOV as it is completely >> managed by firmware. An example being the virtio_net that somehow >> knows how to do SR-IOV (https://patchwork.kernel.org/patch/10241225/). >> My understanding is that the part in question is providing a >> lightweight PF that is essentially just another VF, but has some extra >> PCIe configuration space to it allowing for SR-IOV and AER >> functionality. > > So, give it a light-wt PF driver. > Maybe I'd appreciate this issue better if I understood how 'managed by > firmware' is implemented. So if you take a look at my recent v3 patch set patch 2 includes the patch for virtio_net. Basically all the PF has is a virtio_net interface with an extended PCIe configuration space that includes the necessary support for AER and SR-IOV. It is my understanding that there is no way for the PF MMIO space to configure anything related to the VFs since nothing like that exists in virtio_net. >> >> In addition we are also seeing customer scenarios where things like >> DPDK are running as the PF with kernel VFs. Ideally we want to contain >> that as much as possible and my understanding is that igb_uio >> currently isn't really providing much in the way of containment. > > uh hum... q.e.d.! > >> >> - Alex >> > > IMO, the tail(DPDK; fw?) is wagging the dog(kernel) here. I agree. Part of the reason for this patch set is to keep the discussion on this going since it had kind of died off and after 4+ different attempts at getting something like this I thought I would try to take it on myself since I have been involved in at least 2 of the efforts now. > Either, we can add a light-wt, PF stub driver that does VF assignment > for a conforming PF design (and just keep adding vid/did's to it over time), > or that design is taken, tweaked and a new PF 'lightweight driver' is > created > to handle the hacking/workaround that needs to be done for a PF. That is what I started doing in v2. Basically for v2/v3 of this patch set I had limited it down to virtio and vfio-pci. I'm beginning to thing for v4 I might just drop the vfio-pci bits for now. If the Amazon guys have some device IDs for their devices I could look at putting together the suggested pci-stub-pf for them if needed. Otherwise I will probably just drop this to two patches and they can do it themselves later. > 'Device-assignment' (properly) assumes the PF is in the control/host domain, > and VFs are 'given' to a non-privileged, managed domain (managed by the > control/host domain). > I don't see how you maintain security if the PF isn't in the control/host > domain. So what ends up happening is that we have a privileged process that is necessary for the guests running with assigned VFs to function. The problem as I understand it is we don't really have anything here to enforce the policy needed in order to get security. Instead we are relying on the user to configure things correctly, and we cannot rely on that. I can see what we can do, but of course part of the issue is that when a SIGKILL is all that is needed to end the process it makes coordination of things a pretty big challenge.
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 677924ae0350..7b8858bd8d03 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -446,6 +446,7 @@ static int sriov_init(struct pci_dev *dev, int pos) pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, &iov->vf_device); iov->pgsz = pgsz; iov->self = dev; + iov->autoprobe = true; iov->drivers_autoprobe = true; pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, &iov->cap); pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, &iov->link); @@ -643,8 +644,11 @@ void pci_restore_iov_state(struct pci_dev *dev) */ void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool auto_probe) { - if (dev->is_physfn) + if (dev->is_physfn) { dev->sriov->drivers_autoprobe = auto_probe; + if (!dev->sriov->num_VFs) + dev->sriov->autoprobe = auto_probe; + } } /** @@ -703,6 +707,27 @@ void pci_disable_sriov(struct pci_dev *dev) EXPORT_SYMBOL_GPL(pci_disable_sriov); /** + * pci_sriov_configure_unmanaged - helper to configure unmanaged SR-IOV + * @dev: the PCI device + * @nr_virtfn: number of virtual functions to enable, 0 to disable + * + * Used to provide generic enable/disable SR-IOV option for devices + * that do not manage the VFs generated by their driver, or have no + * driver present. + */ +int pci_sriov_configure_unmanaged(struct pci_dev *dev, int nr_virtfn) +{ + int rc = 0; + + if (!nr_virtfn) + pci_disable_sriov(dev); + else + rc = pci_enable_sriov(dev, nr_virtfn); + + return rc ? rc : nr_virtfn; +} + +/** * pci_num_vf - return number of VFs associated with a PF device_release_driver * @dev: the PCI device * diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index 3bed6beda051..2cc68dff6130 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -398,7 +398,7 @@ void __weak pcibios_free_irq(struct pci_dev *dev) #ifdef CONFIG_PCI_IOV static inline bool pci_device_can_probe(struct pci_dev *pdev) { - return (!pdev->is_virtfn || pdev->physfn->sriov->drivers_autoprobe); + return (!pdev->is_virtfn || pdev->physfn->sriov->autoprobe); } #else static inline bool pci_device_can_probe(struct pci_dev *pdev) diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index eb6bee8724cc..e701b6dbc267 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -605,6 +605,7 @@ static ssize_t sriov_numvfs_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { + int (*sriov_configure)(struct pci_dev *dev, int num_vfs); struct pci_dev *pdev = to_pci_dev(dev); int ret; u16 num_vfs; @@ -622,15 +623,20 @@ static ssize_t sriov_numvfs_store(struct device *dev, goto exit; /* is PF driver loaded w/callback */ - if (!pdev->driver || !pdev->driver->sriov_configure) { - pci_info(pdev, "Driver doesn't support SRIOV configuration via sysfs\n"); - ret = -ENOENT; - goto exit; - } + if (pdev->driver && pdev->driver->sriov_configure) + sriov_configure = pdev->driver->sriov_configure; + else + sriov_configure = pci_sriov_configure_unmanaged; if (num_vfs == 0) { /* disable VFs */ - ret = pdev->driver->sriov_configure(pdev, 0); + ret = sriov_configure(pdev, 0); + + /* + * Fall back to drivers_autoprobe in case legacy driver + * decides to enable SR-IOV on load. + */ + pdev->sriov->autoprobe = pdev->sriov->drivers_autoprobe; goto exit; } @@ -642,7 +648,14 @@ static ssize_t sriov_numvfs_store(struct device *dev, goto exit; } - ret = pdev->driver->sriov_configure(pdev, num_vfs); + /* + * Update autoprobe based on unmanaged_autoprobe settings if PF + * driver is not managing the SR-IOV configuration for this device. + */ + if (!pdev->driver || !pdev->driver->sriov_configure) + pdev->sriov->autoprobe = pdev->sriov->unmanaged_autoprobe; + + ret = sriov_configure(pdev, num_vfs); if (ret < 0) goto exit; @@ -705,7 +718,37 @@ static ssize_t sriov_drivers_autoprobe_store(struct device *dev, if (kstrtobool(buf, &drivers_autoprobe) < 0) return -EINVAL; + device_lock(&pdev->dev); + pdev->sriov->drivers_autoprobe = drivers_autoprobe; + if (!pdev->sriov->num_VFs) + pdev->sriov->autoprobe = drivers_autoprobe; + + device_unlock(&pdev->dev); + + return count; +} + +static ssize_t sriov_unmanaged_autoprobe_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct pci_dev *pdev = to_pci_dev(dev); + + return sprintf(buf, "%u\n", pdev->sriov->unmanaged_autoprobe); +} + +static ssize_t sriov_unmanaged_autoprobe_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct pci_dev *pdev = to_pci_dev(dev); + bool unmanaged_autoprobe; + + if (kstrtobool(buf, &unmanaged_autoprobe) < 0) + return -EINVAL; + + pdev->sriov->unmanaged_autoprobe = unmanaged_autoprobe; return count; } @@ -720,6 +763,10 @@ static ssize_t sriov_drivers_autoprobe_store(struct device *dev, static struct device_attribute sriov_drivers_autoprobe_attr = __ATTR(sriov_drivers_autoprobe, (S_IRUGO|S_IWUSR|S_IWGRP), sriov_drivers_autoprobe_show, sriov_drivers_autoprobe_store); +static struct device_attribute sriov_unmanaged_autoprobe_attr = + __ATTR(sriov_unmanaged_autoprobe, (S_IRUGO|S_IWUSR|S_IWGRP), + sriov_unmanaged_autoprobe_show, + sriov_unmanaged_autoprobe_store); #endif /* CONFIG_PCI_IOV */ static ssize_t driver_override_store(struct device *dev, @@ -1789,6 +1836,7 @@ static umode_t pcie_dev_attrs_are_visible(struct kobject *kobj, &sriov_stride_attr.attr, &sriov_vf_device_attr.attr, &sriov_drivers_autoprobe_attr.attr, + &sriov_unmanaged_autoprobe_attr.attr, NULL, }; diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index fcd81911b127..b5f8b034f02d 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -272,7 +272,9 @@ struct pci_sriov { struct pci_dev *dev; /* Lowest numbered PF */ struct pci_dev *self; /* This PF */ resource_size_t barsz[PCI_SRIOV_NUM_BARS]; /* VF BAR size */ - bool drivers_autoprobe; /* Auto probing of VFs by driver */ + bool autoprobe; /* Auto probing of VFs by VF driver */ + bool drivers_autoprobe; /* "" managed by PF driver */ + bool unmanaged_autoprobe; /* "" unmanaged by kernel */ }; /* pci_dev priv_flags */ diff --git a/include/linux/pci.h b/include/linux/pci.h index 024a1beda008..0d2b10aeadfa 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1947,6 +1947,7 @@ static inline void pci_mmcfg_late_init(void) { } int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); void pci_disable_sriov(struct pci_dev *dev); +int pci_sriov_configure_unmanaged(struct pci_dev *dev, int num_vfs); int pci_iov_add_virtfn(struct pci_dev *dev, int id); void pci_iov_remove_virtfn(struct pci_dev *dev, int id); int pci_num_vf(struct pci_dev *dev);