Message ID | 20210126085730.1165673-2-leon@kernel.org (mailing list archive) |
---|---|
State | Awaiting Upstream |
Headers | show |
Series | Dynamically assign MSI-X vectors count | expand |
On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote: > From: Leon Romanovsky <leonro@nvidia.com> > > Extend PCI sysfs interface with a new callback that allows configure > the number of MSI-X vectors for specific SR-IO VF. This is needed > to optimize the performance of newly bound devices by allocating > the number of vectors based on the administrator knowledge of targeted VM. > > This function is applicable for SR-IOV VF because such devices allocate > their MSI-X table before they will run on the VMs and HW can't guess the > right number of vectors, so the HW allocates them statically and equally. > > 1) The newly added /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count > file will be seen for the VFs and it is writable as long as a driver is not > bounded to the VF. > > The values accepted are: > * > 0 - this will be number reported by the VF's MSI-X capability > * < 0 - not valid > * = 0 - will reset to the device default value > > 2) In order to make management easy, provide new read-only sysfs file that > returns a total number of possible to configure MSI-X vectors. > > cat /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix > = 0 - feature is not supported > > 0 - total number of MSI-X vectors to consume by the VFs > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com> > --- > Documentation/ABI/testing/sysfs-bus-pci | 32 +++++ > drivers/pci/iov.c | 180 ++++++++++++++++++++++++ > drivers/pci/msi.c | 47 +++++++ > drivers/pci/pci.h | 4 + > include/linux/pci.h | 10 ++ > 5 files changed, 273 insertions(+) Bjorn, Can I please get your Acked-by on this so it will go through mlx5-next -> netdev submission flow? Thanks
On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote: > From: Leon Romanovsky <leonro@nvidia.com> > > Extend PCI sysfs interface with a new callback that allows configure > the number of MSI-X vectors for specific SR-IO VF. This is needed > to optimize the performance of newly bound devices by allocating > the number of vectors based on the administrator knowledge of targeted VM. s/configure/configuration of/ s/SR-IO/SR-IOV/ s/newly bound/VFs/ ? s/VF/VFs/ s/knowledge of targeted VM/knowledge of the intended use of the VF/ (I'm not a VF expert, but I think they can be used even without VMs) I'm reading between the lines here, but IIUC the point is that you have a PF that supports a finite number of MSI-X vectors for use by all the VFs, and this interface is to control the distribution of those MSI-X vectors among the VFs. > This function is applicable for SR-IOV VF because such devices allocate > their MSI-X table before they will run on the VMs and HW can't guess the > right number of vectors, so the HW allocates them statically and equally. This is written in a way that suggests this is behavior required by the PCIe spec. If it is indeed related to something in the spec, please cite it. But I think this is actually something device-specific, not something we can derive directly from the spec. If that's the case, be clear that we're addressing a device-specific need, and we're hoping that this will be useful for other devices as well. "such devices allocate their MSI-X table before they will run on the VMs": Let's be specific here. This MSI-X Table allocation apparently doesn't happen when we set VF Enable in the PF, because these sysfs files are attached to the VFs, which don't exist yet. It's not the VF driver binding, because that's a software construct. What is the hardware event that triggers the allocation? Obviously the distribution among VFs can be changed after VF Enable is set. Maybe the distribution is dynamic, and the important point is that it must be changed before the VF driver reads the Message Control register for Table Size? But that isn't the same as "devices allocating their MSI-X table before being passed through to a VM," so it's confusing. The language about allocating the MSI-X table needs to be made precise here and in the code comments below. "before they will run on the VMs": Devices don't "run on VMs". I think the usual terminology is that a device may be "passed through to a VM". "HW allocates them statically and equally" sounds like a description of some device-specific behavior (unless there's something in the spec that requires this, in which case you should cite it). It's OK if this is device-specific; just don't pretend that it's generic if it's not. > 1) The newly added /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count > file will be seen for the VFs and it is writable as long as a driver is not > bounded to the VF. "bound to the VF" > The values accepted are: > * > 0 - this will be number reported by the VF's MSI-X capability Specifically, I guess by Table Size in the VF's MSI-X Message Control register? > * < 0 - not valid > * = 0 - will reset to the device default value > > 2) In order to make management easy, provide new read-only sysfs file that > returns a total number of possible to configure MSI-X vectors. > > cat /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix > = 0 - feature is not supported > > 0 - total number of MSI-X vectors to consume by the VFs "total number of MSI-X vectors available for distribution among the VFs"? > Signed-off-by: Leon Romanovsky <leonro@nvidia.com> > --- > Documentation/ABI/testing/sysfs-bus-pci | 32 +++++ > drivers/pci/iov.c | 180 ++++++++++++++++++++++++ > drivers/pci/msi.c | 47 +++++++ > drivers/pci/pci.h | 4 + > include/linux/pci.h | 10 ++ > 5 files changed, 273 insertions(+) > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci > index 25c9c39770c6..4d206ade5331 100644 > --- a/Documentation/ABI/testing/sysfs-bus-pci > +++ b/Documentation/ABI/testing/sysfs-bus-pci > @@ -375,3 +375,35 @@ Description: > The value comes from the PCI kernel device state and can be one > of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold". > The file is read only. > + > +What: /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count > +Date: January 2021 > +Contact: Leon Romanovsky <leonro@nvidia.com> > +Description: > + This file is associated with the SR-IOV VFs. > + It allows configuration of the number of MSI-X vectors for > + the VF. This is needed to optimize performance of newly bound > + devices by allocating the number of vectors based on the > + administrator knowledge of targeted VM. > + > + The values accepted are: > + * > 0 - this will be number reported by the VF's MSI-X > + capability > + * < 0 - not valid > + * = 0 - will reset to the device default value > + > + The file is writable if the PF is bound to a driver that > + set sriov_vf_total_msix > 0 and there is no driver bound > + to the VF. > + > +What: /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix > +Date: January 2021 > +Contact: Leon Romanovsky <leonro@nvidia.com> > +Description: > + This file is associated with the SR-IOV PFs. > + It returns a total number of possible to configure MSI-X > + vectors on the enabled VFs. > + > + The values returned are: > + * > 0 - this will be total number possible to consume by VFs, > + * = 0 - feature is not supported What does "vfs_overlay" mean? "vfs" sounds like the Virtual File System. Do these filenames really need to contain both "sriov" and "vf"? Should these be next to the existing SR-IOV sysfs files, i.e., in or alongside sriov_dev_attr_group? Hmmm, I see pci_enable_vfs_overlay() is called by the driver. I don't really like that because then we're dependent on drivers to maintain the PCI sysfs hierarchy. E.g., a driver might forget to call pci_disable_vfs_overlay(), and then a future driver load will fail. Maybe this could be done with .is_visible() functions that call driver callbacks. > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > index 4afd4ee4f7f0..3e95f835eba5 100644 > --- a/drivers/pci/iov.c > +++ b/drivers/pci/iov.c > @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id) > return (dev->devfn + dev->sriov->offset + > dev->sriov->stride * vf_id) & 0xff; > } > +EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn); > > /* > * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may > @@ -157,6 +158,166 @@ int pci_iov_sysfs_link(struct pci_dev *dev, > return rc; > } > > +#ifdef CONFIG_PCI_MSI > +static ssize_t sriov_vf_msix_count_show(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + struct pci_dev *pdev = to_pci_dev(dev); > + int count = pci_msix_vec_count(pdev); > + > + if (count < 0) > + return count; > + > + return sysfs_emit(buf, "%d\n", count); > +} > + > +static ssize_t sriov_vf_msix_count_store(struct device *dev, > + struct device_attribute *attr, > + const char *buf, size_t count) > +{ > + struct pci_dev *vf_dev = to_pci_dev(dev); > + int val, ret; > + > + ret = kstrtoint(buf, 0, &val); > + if (ret) > + return ret; > + > + ret = pci_vf_set_msix_vec_count(vf_dev, val); > + if (ret) > + return ret; > + > + return count; > +} > +static DEVICE_ATTR_RW(sriov_vf_msix_count); > + > +static ssize_t sriov_vf_total_msix_show(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + struct pci_dev *pdev = to_pci_dev(dev); > + > + return sysfs_emit(buf, "%u\n", pdev->sriov->vf_total_msix); > +} > +static DEVICE_ATTR_RO(sriov_vf_total_msix); > +#endif > + > +static struct attribute *sriov_pf_dev_attrs[] = { > +#ifdef CONFIG_PCI_MSI > + &dev_attr_sriov_vf_total_msix.attr, > +#endif > + NULL, > +}; > + > +static struct attribute *sriov_vf_dev_attrs[] = { > +#ifdef CONFIG_PCI_MSI > + &dev_attr_sriov_vf_msix_count.attr, > +#endif > + NULL, > +}; > + > +static umode_t sriov_pf_attrs_are_visible(struct kobject *kobj, > + struct attribute *a, int n) > +{ > + struct device *dev = kobj_to_dev(kobj); > + struct pci_dev *pdev = to_pci_dev(dev); > + > + if (!pdev->msix_cap || !dev_is_pf(dev)) > + return 0; > + > + return a->mode; > +} > + > +static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj, > + struct attribute *a, int n) > +{ > + struct device *dev = kobj_to_dev(kobj); > + struct pci_dev *pdev = to_pci_dev(dev); > + > + if (!pdev->msix_cap || dev_is_pf(dev)) > + return 0; > + > + return a->mode; > +} > + > +static const struct attribute_group sriov_pf_dev_attr_group = { > + .attrs = sriov_pf_dev_attrs, > + .is_visible = sriov_pf_attrs_are_visible, > + .name = "vfs_overlay", > +}; > + > +static const struct attribute_group sriov_vf_dev_attr_group = { > + .attrs = sriov_vf_dev_attrs, > + .is_visible = sriov_vf_attrs_are_visible, > + .name = "vfs_overlay", > +}; > + > +int pci_enable_vfs_overlay(struct pci_dev *dev) > +{ > + struct pci_dev *virtfn; > + int id, ret; > + > + if (!dev->is_physfn || !dev->sriov->num_VFs) > + return 0; > + > + ret = sysfs_create_group(&dev->dev.kobj, &sriov_pf_dev_attr_group); > + if (ret) > + return ret; > + > + for (id = 0; id < dev->sriov->num_VFs; id++) { > + virtfn = pci_get_domain_bus_and_slot( > + pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id), > + pci_iov_virtfn_devfn(dev, id)); > + > + if (!virtfn) > + continue; > + > + ret = sysfs_create_group(&virtfn->dev.kobj, > + &sriov_vf_dev_attr_group); > + if (ret) > + goto out; > + } > + return 0; > + > +out: > + while (id--) { > + virtfn = pci_get_domain_bus_and_slot( > + pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id), > + pci_iov_virtfn_devfn(dev, id)); > + > + if (!virtfn) > + continue; > + > + sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group); > + } > + sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group); > + return ret; > +} > +EXPORT_SYMBOL_GPL(pci_enable_vfs_overlay); > + > +void pci_disable_vfs_overlay(struct pci_dev *dev) > +{ > + struct pci_dev *virtfn; > + int id; > + > + if (!dev->is_physfn || !dev->sriov->num_VFs) > + return; > + > + id = dev->sriov->num_VFs; > + while (id--) { > + virtfn = pci_get_domain_bus_and_slot( > + pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id), > + pci_iov_virtfn_devfn(dev, id)); > + > + if (!virtfn) > + continue; > + > + sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group); > + } > + sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group); > +} > +EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay); I'm not convinced all this sysfs wrangling is necessary. If it is, add a hint in a comment about why this is special and can't use something like sriov_dev_attr_group. > int pci_iov_add_virtfn(struct pci_dev *dev, int id) > { > int i; > @@ -596,6 +757,7 @@ static void sriov_disable(struct pci_dev *dev) > sysfs_remove_link(&dev->dev.kobj, "dep_link"); > > iov->num_VFs = 0; > + iov->vf_total_msix = 0; > pci_iov_set_numvfs(dev, 0); > } > > @@ -1054,6 +1216,24 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev) > } > EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs); > > +/** > + * pci_sriov_set_vf_total_msix - set total number of MSI-X vectors for the VFs > + * @dev: the PCI PF device > + * @count: the total number of MSI-X vector to consume by the VFs > + * > + * Sets the number of MSI-X vectors that is possible to consume by the VFs. > + * This interface is complimentary part of the pci_vf_set_msix_vec_count() s/Sets the/Set the/ s/complimentary part of the/complementary to/ > + * that will be used to configure the required number on the VF. > + */ > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) > +{ > + if (!dev->is_physfn) > + return; > + > + dev->sriov->vf_total_msix = count; The PCI core doesn't use vf_total_msix at all. The driver, e.g., mlx5, calls this, and all the PCI core does is hang onto the value and expose it via sysfs. I think I'd rather have a callback in struct pci_driver and let the driver supply the value when needed. I.e., sriov_vf_total_msix_show() would call the callback instead of looking at pdev->sriov->vf_total_msix. > +} > +EXPORT_SYMBOL_GPL(pci_sriov_set_vf_total_msix); > + > /** > * pci_sriov_configure_simple - helper to configure SR-IOV > * @dev: the PCI device > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > index 3162f88fe940..1022fe9e6efd 100644 > --- a/drivers/pci/msi.c > +++ b/drivers/pci/msi.c > @@ -991,6 +991,53 @@ int pci_msix_vec_count(struct pci_dev *dev) > } > EXPORT_SYMBOL(pci_msix_vec_count); > > +/** > + * pci_vf_set_msix_vec_count - change the reported number of MSI-X vectors > + * This function is applicable for SR-IOV VF because such devices allocate > + * their MSI-X table before they will run on the VMs and HW can't guess the > + * right number of vectors, so the HW allocates them statically and equally. > + * @dev: VF device that is going to be changed > + * @count: amount of MSI-X vectors s/amount/number/ > + **/ > +int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count) > +{ > + struct pci_dev *pdev = pci_physfn(dev); > + int ret; > + > + if (count < 0) > + /* > + * We don't support negative numbers for now, > + * but maybe in the future it will make sense. > + */ Drop the comment; I don't think it adds useful information. > + return -EINVAL; > + > + device_lock(&pdev->dev); > + if (!pdev->driver) { > + ret = -EOPNOTSUPP; > + goto err_pdev; > + } > + > + device_lock(&dev->dev); > + if (dev->driver) { > + /* > + * Driver already probed this VF and configured itself > + * based on previously configured (or default) MSI-X vector > + * count. It is too late to change this field for this > + * specific VF. > + */ > + ret = -EBUSY; > + goto err_dev; > + } > + > + ret = pdev->driver->sriov_set_msix_vec_count(dev, count); This looks like a NULL pointer dereference. > +err_dev: > + device_unlock(&dev->dev); > +err_pdev: > + device_unlock(&pdev->dev); > + return ret; > +} > + > static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, > int nvec, struct irq_affinity *affd, int flags) > { > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > index 5c59365092fa..2bd6560d91e2 100644 > --- a/drivers/pci/pci.h > +++ b/drivers/pci/pci.h > @@ -183,6 +183,7 @@ extern unsigned int pci_pm_d3hot_delay; > > #ifdef CONFIG_PCI_MSI > void pci_no_msi(void); > +int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count); > #else > static inline void pci_no_msi(void) { } > #endif > @@ -326,6 +327,9 @@ struct pci_sriov { > u16 subsystem_device; /* VF subsystem device */ > resource_size_t barsz[PCI_SRIOV_NUM_BARS]; /* VF BAR size */ > bool drivers_autoprobe; /* Auto probing of VFs by driver */ > + u32 vf_total_msix; /* Total number of MSI-X vectors the VFs > + * can consume > + */ * can consume */ Hopefully you can eliminate vf_total_msix altogether. > }; > > /** > diff --git a/include/linux/pci.h b/include/linux/pci.h > index b32126d26997..24d118ad6e7b 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -856,6 +856,8 @@ struct module; > * e.g. drivers/net/e100.c. > * @sriov_configure: Optional driver callback to allow configuration of > * number of VFs to enable via sysfs "sriov_numvfs" file. > + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors > + * exposed by the sysfs "vf_msix_vec" entry. "vf_msix_vec" is apparently stale? There's no other reference in this patch. I think the important part is that this changes the number of vectors advertised in the VF's MSI-X Message Control register, which will be read when the VF driver enables MSI-X. If that's true, why would we expose this via a sysfs file? We can easily read it via lspci. > * @err_handler: See Documentation/PCI/pci-error-recovery.rst > * @groups: Sysfs attribute groups. > * @driver: Driver model structure. > @@ -871,6 +873,7 @@ struct pci_driver { > int (*resume)(struct pci_dev *dev); /* Device woken up */ > void (*shutdown)(struct pci_dev *dev); > int (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */ > + int (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */ > const struct pci_error_handlers *err_handler; > const struct attribute_group **groups; > struct device_driver driver; > @@ -2059,6 +2062,9 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar); > int pci_iov_virtfn_bus(struct pci_dev *dev, int id); > int pci_iov_virtfn_devfn(struct pci_dev *dev, int id); > > +int pci_enable_vfs_overlay(struct pci_dev *dev); > +void pci_disable_vfs_overlay(struct pci_dev *dev); > + > int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); > void pci_disable_sriov(struct pci_dev *dev); > > @@ -2072,6 +2078,7 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev); > int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn); > resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno); > void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe); > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count); > > /* Arch may override these (weak) */ > int pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs); > @@ -2100,6 +2107,8 @@ static inline int pci_iov_add_virtfn(struct pci_dev *dev, int id) > } > static inline void pci_iov_remove_virtfn(struct pci_dev *dev, > int id) { } > +static inline int pci_enable_vfs_overlay(struct pci_dev *dev) { return 0; } > +static inline void pci_disable_vfs_overlay(struct pci_dev *dev) {} s/{}/{ }/ Please make your code match the rest of the file, e.g., the very next line! > static inline void pci_disable_sriov(struct pci_dev *dev) { } > static inline int pci_num_vf(struct pci_dev *dev) { return 0; } > static inline int pci_vfs_assigned(struct pci_dev *dev) > @@ -2112,6 +2121,7 @@ static inline int pci_sriov_get_totalvfs(struct pci_dev *dev) > static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno) > { return 0; } > static inline void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe) { } > +static inline void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) {} Also here. Unless removing the space would make this fit in 80 columns. > #endif > > #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE) > -- > 2.29.2 >
On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote: > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote: > > From: Leon Romanovsky <leonro@nvidia.com> > > > > Extend PCI sysfs interface with a new callback that allows configure > > the number of MSI-X vectors for specific SR-IO VF. This is needed > > to optimize the performance of newly bound devices by allocating > > the number of vectors based on the administrator knowledge of targeted VM. > > s/configure/configuration of/ > s/SR-IO/SR-IOV/ > s/newly bound/VFs/ ? > s/VF/VFs/ > s/knowledge of targeted VM/knowledge of the intended use of the VF/ > (I'm not a VF expert, but I think they can be used even without VMs) > > I'm reading between the lines here, but IIUC the point is that you > have a PF that supports a finite number of MSI-X vectors for use by > all the VFs, and this interface is to control the distribution of > those MSI-X vectors among the VFs. The MSI-X is HW resource, all devices in the world have limitation here. > > > This function is applicable for SR-IOV VF because such devices allocate > > their MSI-X table before they will run on the VMs and HW can't guess the > > right number of vectors, so the HW allocates them statically and equally. > > This is written in a way that suggests this is behavior required by > the PCIe spec. If it is indeed related to something in the spec, > please cite it. Spec doesn't say it directly, but you will need to really hurt brain of your users if you decide to do it differently. You have one enable bit to create all VFs at the same time without any option to configure them in advance. Of course, you can create some partition map, upload it to FW and create from there. > > But I think this is actually something device-specific, not something > we can derive directly from the spec. If that's the case, be clear > that we're addressing a device-specific need, and we're hoping that > this will be useful for other devices as well. I will add. > > "such devices allocate their MSI-X table before they will run on the > VMs": Let's be specific here. This MSI-X Table allocation apparently > doesn't happen when we set VF Enable in the PF, because these sysfs > files are attached to the VFs, which don't exist yet. It's not the VF > driver binding, because that's a software construct. What is the > hardware event that triggers the allocation? Write of MSI-X vector count to the FW through PF. > > Obviously the distribution among VFs can be changed after VF Enable is > set. Maybe the distribution is dynamic, and the important point is > that it must be changed before the VF driver reads the Message Control > register for Table Size? Yes > > But that isn't the same as "devices allocating their MSI-X table > before being passed through to a VM," so it's confusing. The > language about allocating the MSI-X table needs to be made precise > here and in the code comments below. > > "before they will run on the VMs": Devices don't "run on VMs". I > think the usual terminology is that a device may be "passed through to > a VM". > > "HW allocates them statically and equally" sounds like a description > of some device-specific behavior (unless there's something in the spec > that requires this, in which case you should cite it). It's OK if > this is device-specific; just don't pretend that it's generic if it's > not. I will change. > > > 1) The newly added /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count > > file will be seen for the VFs and it is writable as long as a driver is not > > bounded to the VF. > > "bound to the VF" > > > The values accepted are: > > * > 0 - this will be number reported by the VF's MSI-X capability > > Specifically, I guess by Table Size in the VF's MSI-X Message Control > register? Right > > > * < 0 - not valid > > * = 0 - will reset to the device default value > > > > 2) In order to make management easy, provide new read-only sysfs file that > > returns a total number of possible to configure MSI-X vectors. > > > > cat /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix > > = 0 - feature is not supported > > > 0 - total number of MSI-X vectors to consume by the VFs > > "total number of MSI-X vectors available for distribution among the > VFs"? Users need to be aware of how much vectors exist in the system. > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com> > > --- > > Documentation/ABI/testing/sysfs-bus-pci | 32 +++++ > > drivers/pci/iov.c | 180 ++++++++++++++++++++++++ > > drivers/pci/msi.c | 47 +++++++ > > drivers/pci/pci.h | 4 + > > include/linux/pci.h | 10 ++ > > 5 files changed, 273 insertions(+) > > > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci > > index 25c9c39770c6..4d206ade5331 100644 > > --- a/Documentation/ABI/testing/sysfs-bus-pci > > +++ b/Documentation/ABI/testing/sysfs-bus-pci > > @@ -375,3 +375,35 @@ Description: > > The value comes from the PCI kernel device state and can be one > > of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold". > > The file is read only. > > + > > +What: /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count > > +Date: January 2021 > > +Contact: Leon Romanovsky <leonro@nvidia.com> > > +Description: > > + This file is associated with the SR-IOV VFs. > > + It allows configuration of the number of MSI-X vectors for > > + the VF. This is needed to optimize performance of newly bound > > + devices by allocating the number of vectors based on the > > + administrator knowledge of targeted VM. > > + > > + The values accepted are: > > + * > 0 - this will be number reported by the VF's MSI-X > > + capability > > + * < 0 - not valid > > + * = 0 - will reset to the device default value > > + > > + The file is writable if the PF is bound to a driver that > > + set sriov_vf_total_msix > 0 and there is no driver bound > > + to the VF. > > + > > +What: /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix > > +Date: January 2021 > > +Contact: Leon Romanovsky <leonro@nvidia.com> > > +Description: > > + This file is associated with the SR-IOV PFs. > > + It returns a total number of possible to configure MSI-X > > + vectors on the enabled VFs. > > + > > + The values returned are: > > + * > 0 - this will be total number possible to consume by VFs, > > + * = 0 - feature is not supported > > What does "vfs_overlay" mean? "vfs" sounds like the Virtual File > System. VFs == many VF? > > Do these filenames really need to contain both "sriov" and "vf"? This is what I was asked at some point. In previous versions the name was without "sriov". > > Should these be next to the existing SR-IOV sysfs files, i.e., in or > alongside sriov_dev_attr_group? This was suggestion in previous versions. I didn't hear anyone supporting it. > > Hmmm, I see pci_enable_vfs_overlay() is called by the driver. I don't > really like that because then we're dependent on drivers to maintain > the PCI sysfs hierarchy. E.g., a driver might forget to call > pci_disable_vfs_overlay(), and then a future driver load will fail. It is not different from any other kernel bug. I have gazillion ways to break the system with crappy driver. > > Maybe this could be done with .is_visible() functions that call driver > callbacks. It was in previous versions too, but this solution allows PF to control the VFs overlay dynamically. > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > > index 4afd4ee4f7f0..3e95f835eba5 100644 > > --- a/drivers/pci/iov.c > > +++ b/drivers/pci/iov.c > > @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id) > > return (dev->devfn + dev->sriov->offset + > > dev->sriov->stride * vf_id) & 0xff; > > } > > +EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn); > > > > /* > > * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may > > @@ -157,6 +158,166 @@ int pci_iov_sysfs_link(struct pci_dev *dev, > > return rc; > > } > > > > +#ifdef CONFIG_PCI_MSI > > +static ssize_t sriov_vf_msix_count_show(struct device *dev, > > + struct device_attribute *attr, > > + char *buf) > > +{ > > + struct pci_dev *pdev = to_pci_dev(dev); > > + int count = pci_msix_vec_count(pdev); > > + > > + if (count < 0) > > + return count; > > + > > + return sysfs_emit(buf, "%d\n", count); > > +} > > + > > +static ssize_t sriov_vf_msix_count_store(struct device *dev, > > + struct device_attribute *attr, > > + const char *buf, size_t count) > > +{ > > + struct pci_dev *vf_dev = to_pci_dev(dev); > > + int val, ret; > > + > > + ret = kstrtoint(buf, 0, &val); > > + if (ret) > > + return ret; > > + > > + ret = pci_vf_set_msix_vec_count(vf_dev, val); > > + if (ret) > > + return ret; > > + > > + return count; > > +} > > +static DEVICE_ATTR_RW(sriov_vf_msix_count); > > + > > +static ssize_t sriov_vf_total_msix_show(struct device *dev, > > + struct device_attribute *attr, > > + char *buf) > > +{ > > + struct pci_dev *pdev = to_pci_dev(dev); > > + > > + return sysfs_emit(buf, "%u\n", pdev->sriov->vf_total_msix); > > +} > > +static DEVICE_ATTR_RO(sriov_vf_total_msix); > > +#endif > > + > > +static struct attribute *sriov_pf_dev_attrs[] = { > > +#ifdef CONFIG_PCI_MSI > > + &dev_attr_sriov_vf_total_msix.attr, > > +#endif > > + NULL, > > +}; > > + > > +static struct attribute *sriov_vf_dev_attrs[] = { > > +#ifdef CONFIG_PCI_MSI > > + &dev_attr_sriov_vf_msix_count.attr, > > +#endif > > + NULL, > > +}; > > + > > +static umode_t sriov_pf_attrs_are_visible(struct kobject *kobj, > > + struct attribute *a, int n) > > +{ > > + struct device *dev = kobj_to_dev(kobj); > > + struct pci_dev *pdev = to_pci_dev(dev); > > + > > + if (!pdev->msix_cap || !dev_is_pf(dev)) > > + return 0; > > + > > + return a->mode; > > +} > > + > > +static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj, > > + struct attribute *a, int n) > > +{ > > + struct device *dev = kobj_to_dev(kobj); > > + struct pci_dev *pdev = to_pci_dev(dev); > > + > > + if (!pdev->msix_cap || dev_is_pf(dev)) > > + return 0; > > + > > + return a->mode; > > +} > > + > > +static const struct attribute_group sriov_pf_dev_attr_group = { > > + .attrs = sriov_pf_dev_attrs, > > + .is_visible = sriov_pf_attrs_are_visible, > > + .name = "vfs_overlay", > > +}; > > + > > +static const struct attribute_group sriov_vf_dev_attr_group = { > > + .attrs = sriov_vf_dev_attrs, > > + .is_visible = sriov_vf_attrs_are_visible, > > + .name = "vfs_overlay", > > +}; > > + > > +int pci_enable_vfs_overlay(struct pci_dev *dev) > > +{ > > + struct pci_dev *virtfn; > > + int id, ret; > > + > > + if (!dev->is_physfn || !dev->sriov->num_VFs) > > + return 0; > > + > > + ret = sysfs_create_group(&dev->dev.kobj, &sriov_pf_dev_attr_group); > > + if (ret) > > + return ret; > > + > > + for (id = 0; id < dev->sriov->num_VFs; id++) { > > + virtfn = pci_get_domain_bus_and_slot( > > + pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id), > > + pci_iov_virtfn_devfn(dev, id)); > > + > > + if (!virtfn) > > + continue; > > + > > + ret = sysfs_create_group(&virtfn->dev.kobj, > > + &sriov_vf_dev_attr_group); > > + if (ret) > > + goto out; > > + } > > + return 0; > > + > > +out: > > + while (id--) { > > + virtfn = pci_get_domain_bus_and_slot( > > + pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id), > > + pci_iov_virtfn_devfn(dev, id)); > > + > > + if (!virtfn) > > + continue; > > + > > + sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group); > > + } > > + sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group); > > + return ret; > > +} > > +EXPORT_SYMBOL_GPL(pci_enable_vfs_overlay); > > + > > +void pci_disable_vfs_overlay(struct pci_dev *dev) > > +{ > > + struct pci_dev *virtfn; > > + int id; > > + > > + if (!dev->is_physfn || !dev->sriov->num_VFs) > > + return; > > + > > + id = dev->sriov->num_VFs; > > + while (id--) { > > + virtfn = pci_get_domain_bus_and_slot( > > + pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id), > > + pci_iov_virtfn_devfn(dev, id)); > > + > > + if (!virtfn) > > + continue; > > + > > + sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group); > > + } > > + sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group); > > +} > > +EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay); > > I'm not convinced all this sysfs wrangling is necessary. If it is, > add a hint in a comment about why this is special and can't use > something like sriov_dev_attr_group. This makes the overlay to be PF-driven. Alexander insisted on this flow. > > > int pci_iov_add_virtfn(struct pci_dev *dev, int id) > > { > > int i; > > @@ -596,6 +757,7 @@ static void sriov_disable(struct pci_dev *dev) > > sysfs_remove_link(&dev->dev.kobj, "dep_link"); > > > > iov->num_VFs = 0; > > + iov->vf_total_msix = 0; > > pci_iov_set_numvfs(dev, 0); > > } > > > > @@ -1054,6 +1216,24 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev) > > } > > EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs); > > > > +/** > > + * pci_sriov_set_vf_total_msix - set total number of MSI-X vectors for the VFs > > + * @dev: the PCI PF device > > + * @count: the total number of MSI-X vector to consume by the VFs > > + * > > + * Sets the number of MSI-X vectors that is possible to consume by the VFs. > > + * This interface is complimentary part of the pci_vf_set_msix_vec_count() > > s/Sets the/Set the/ > s/complimentary part of the/complementary to/ > > > + * that will be used to configure the required number on the VF. > > + */ > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) > > +{ > > + if (!dev->is_physfn) > > + return; > > + > > + dev->sriov->vf_total_msix = count; > > The PCI core doesn't use vf_total_msix at all. The driver, e.g., > mlx5, calls this, and all the PCI core does is hang onto the value and > expose it via sysfs. I think I'd rather have a callback in struct > pci_driver and let the driver supply the value when needed. I.e., > sriov_vf_total_msix_show() would call the callback instead of looking > at pdev->sriov->vf_total_msix. It will cause to unnecessary locking to ensure that driver doesn't vanish during sysfs read. I can change, but don't think that it is right decision. > > > +} > > +EXPORT_SYMBOL_GPL(pci_sriov_set_vf_total_msix); > > + > > /** > > * pci_sriov_configure_simple - helper to configure SR-IOV > > * @dev: the PCI device > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > > index 3162f88fe940..1022fe9e6efd 100644 > > --- a/drivers/pci/msi.c > > +++ b/drivers/pci/msi.c > > @@ -991,6 +991,53 @@ int pci_msix_vec_count(struct pci_dev *dev) > > } > > EXPORT_SYMBOL(pci_msix_vec_count); > > > > +/** > > + * pci_vf_set_msix_vec_count - change the reported number of MSI-X vectors > > + * This function is applicable for SR-IOV VF because such devices allocate > > + * their MSI-X table before they will run on the VMs and HW can't guess the > > + * right number of vectors, so the HW allocates them statically and equally. > > + * @dev: VF device that is going to be changed > > + * @count: amount of MSI-X vectors > > s/amount/number/ > > > + **/ > > +int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count) > > +{ > > + struct pci_dev *pdev = pci_physfn(dev); > > + int ret; > > + > > + if (count < 0) > > + /* > > + * We don't support negative numbers for now, > > + * but maybe in the future it will make sense. > > + */ > > Drop the comment; I don't think it adds useful information. > > > + return -EINVAL; > > + > > + device_lock(&pdev->dev); > > + if (!pdev->driver) { > > + ret = -EOPNOTSUPP; > > + goto err_pdev; > > + } > > + > > + device_lock(&dev->dev); > > + if (dev->driver) { > > + /* > > + * Driver already probed this VF and configured itself > > + * based on previously configured (or default) MSI-X vector > > + * count. It is too late to change this field for this > > + * specific VF. > > + */ > > + ret = -EBUSY; > > + goto err_dev; > > + } > > + > > + ret = pdev->driver->sriov_set_msix_vec_count(dev, count); > > This looks like a NULL pointer dereference. In current code, it is impossible, the call to pci_vf_set_msix_vec_count() will be only for devices that supports sriov_vf_msix_count which is possible with ->sriov_set_msix_vec_count() only. But I will add the check to be bullet proof for the future extensions. > > > +err_dev: > > + device_unlock(&dev->dev); > > +err_pdev: > > + device_unlock(&pdev->dev); > > + return ret; > > +} > > + > > static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, > > int nvec, struct irq_affinity *affd, int flags) > > { > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > > index 5c59365092fa..2bd6560d91e2 100644 > > --- a/drivers/pci/pci.h > > +++ b/drivers/pci/pci.h > > @@ -183,6 +183,7 @@ extern unsigned int pci_pm_d3hot_delay; > > > > #ifdef CONFIG_PCI_MSI > > void pci_no_msi(void); > > +int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count); > > #else > > static inline void pci_no_msi(void) { } > > #endif > > @@ -326,6 +327,9 @@ struct pci_sriov { > > u16 subsystem_device; /* VF subsystem device */ > > resource_size_t barsz[PCI_SRIOV_NUM_BARS]; /* VF BAR size */ > > bool drivers_autoprobe; /* Auto probing of VFs by driver */ > > + u32 vf_total_msix; /* Total number of MSI-X vectors the VFs > > + * can consume > > + */ > > * can consume */ > > Hopefully you can eliminate vf_total_msix altogether. I think that will be worse from the flow point of view (extra locks) and the memory if you are worried about it. This variable consumes 4 bytes, the pointer to the function will take 8 bytes. > > > }; > > > > /** > > diff --git a/include/linux/pci.h b/include/linux/pci.h > > index b32126d26997..24d118ad6e7b 100644 > > --- a/include/linux/pci.h > > +++ b/include/linux/pci.h > > @@ -856,6 +856,8 @@ struct module; > > * e.g. drivers/net/e100.c. > > * @sriov_configure: Optional driver callback to allow configuration of > > * number of VFs to enable via sysfs "sriov_numvfs" file. > > + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors > > + * exposed by the sysfs "vf_msix_vec" entry. > > "vf_msix_vec" is apparently stale? There's no other reference in this > patch. > > I think the important part is that this changes the number of vectors > advertised in the VF's MSI-X Message Control register, which will be > read when the VF driver enables MSI-X. > > If that's true, why would we expose this via a sysfs file? We can > easily read it via lspci. I did it for feature complete, we don't need read of this sysfs. > > > * @err_handler: See Documentation/PCI/pci-error-recovery.rst > > * @groups: Sysfs attribute groups. > > * @driver: Driver model structure. > > @@ -871,6 +873,7 @@ struct pci_driver { > > int (*resume)(struct pci_dev *dev); /* Device woken up */ > > void (*shutdown)(struct pci_dev *dev); > > int (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */ > > + int (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */ > > const struct pci_error_handlers *err_handler; > > const struct attribute_group **groups; > > struct device_driver driver; > > @@ -2059,6 +2062,9 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar); > > int pci_iov_virtfn_bus(struct pci_dev *dev, int id); > > int pci_iov_virtfn_devfn(struct pci_dev *dev, int id); > > > > +int pci_enable_vfs_overlay(struct pci_dev *dev); > > +void pci_disable_vfs_overlay(struct pci_dev *dev); > > + > > int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); > > void pci_disable_sriov(struct pci_dev *dev); > > > > @@ -2072,6 +2078,7 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev); > > int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn); > > resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno); > > void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe); > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count); > > > > /* Arch may override these (weak) */ > > int pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs); > > @@ -2100,6 +2107,8 @@ static inline int pci_iov_add_virtfn(struct pci_dev *dev, int id) > > } > > static inline void pci_iov_remove_virtfn(struct pci_dev *dev, > > int id) { } > > +static inline int pci_enable_vfs_overlay(struct pci_dev *dev) { return 0; } > > +static inline void pci_disable_vfs_overlay(struct pci_dev *dev) {} > > s/{}/{ }/ > Please make your code match the rest of the file, e.g., the very next line! > > > static inline void pci_disable_sriov(struct pci_dev *dev) { } > > static inline int pci_num_vf(struct pci_dev *dev) { return 0; } > > static inline int pci_vfs_assigned(struct pci_dev *dev) > > @@ -2112,6 +2121,7 @@ static inline int pci_sriov_get_totalvfs(struct pci_dev *dev) > > static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno) > > { return 0; } > > static inline void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe) { } > > +static inline void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) {} > > Also here. Unless removing the space would make this fit in 80 > columns. Yes, it is 80 columns without space. Thanks > > > #endif > > > > #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE) > > -- > > 2.29.2 > >
On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote: > On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote: > > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote: > > > From: Leon Romanovsky <leonro@nvidia.com> > > > > > > Extend PCI sysfs interface with a new callback that allows > > > configure the number of MSI-X vectors for specific SR-IO VF. > > > This is needed to optimize the performance of newly bound > > > devices by allocating the number of vectors based on the > > > administrator knowledge of targeted VM. > > > > I'm reading between the lines here, but IIUC the point is that you > > have a PF that supports a finite number of MSI-X vectors for use > > by all the VFs, and this interface is to control the distribution > > of those MSI-X vectors among the VFs. > > The MSI-X is HW resource, all devices in the world have limitation > here. > > > > This function is applicable for SR-IOV VF because such devices > > > allocate their MSI-X table before they will run on the VMs and > > > HW can't guess the right number of vectors, so the HW allocates > > > them statically and equally. > > > > This is written in a way that suggests this is behavior required > > by the PCIe spec. If it is indeed related to something in the > > spec, please cite it. > > Spec doesn't say it directly, but you will need to really hurt brain > of your users if you decide to do it differently. You have one > enable bit to create all VFs at the same time without any option to > configure them in advance. > > Of course, you can create some partition map, upload it to FW and > create from there. Of course all devices have limitations. But let's add some details about the way *this* device works. That will help avoid giving the impression that this is the *only* way spec-conforming devices can work. > > "such devices allocate their MSI-X table before they will run on > > the VMs": Let's be specific here. This MSI-X Table allocation > > apparently doesn't happen when we set VF Enable in the PF, because > > these sysfs files are attached to the VFs, which don't exist yet. > > It's not the VF driver binding, because that's a software > > construct. What is the hardware event that triggers the > > allocation? > > Write of MSI-X vector count to the FW through PF. This is an example of something that is obviously specific to this mlx5 device. The Table Size field in Message Control is RO per spec, and obviously firmware on the device is completely outside the scope of the PCIe spec. This commit log should describe how *this* device manages this allocation and how the PF Table Size and the VF Table Sizes are related. Per PCIe, there is no necessary connection between them. > > > cat /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix > > > = 0 - feature is not supported > > > > 0 - total number of MSI-X vectors to consume by the VFs > > > > "total number of MSI-X vectors available for distribution among the > > VFs"? > > Users need to be aware of how much vectors exist in the system. Understood -- if there's an interface to influence the distribution of vectors among VFs, one needs to know how many vectors there are to work with. My point was that "number of vectors to consume by VFs" is awkward wording, so I suggested an alternative. > > > +What: /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count > > > +Date: January 2021 > > > +Contact: Leon Romanovsky <leonro@nvidia.com> > > > +Description: > > > + This file is associated with the SR-IOV VFs. > > > + It allows configuration of the number of MSI-X vectors for > > > + the VF. This is needed to optimize performance of newly bound > > > + devices by allocating the number of vectors based on the > > > + administrator knowledge of targeted VM. > > > + > > > + The values accepted are: > > > + * > 0 - this will be number reported by the VF's MSI-X > > > + capability > > > + * < 0 - not valid > > > + * = 0 - will reset to the device default value > > > + > > > + The file is writable if the PF is bound to a driver that > > > + set sriov_vf_total_msix > 0 and there is no driver bound > > > + to the VF. Drivers don't actually set "sriov_vf_total_msix". This should probably say something like "the PF is bound to a driver that implements ->sriov_set_msix_vec_count()." > > > +What: /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix > > > +Date: January 2021 > > > +Contact: Leon Romanovsky <leonro@nvidia.com> > > > +Description: > > > + This file is associated with the SR-IOV PFs. > > > + It returns a total number of possible to configure MSI-X > > > + vectors on the enabled VFs. > > > + > > > + The values returned are: > > > + * > 0 - this will be total number possible to consume by VFs, > > > + * = 0 - feature is not supported Can you swap the order of these two files in the documentation? sriov_vf_total_msix is associated with the PF and logically precedes sriov_vf_msix_count, which is associated with VFs. Not sure the "= 0" description is necessary here. If the value returned is the number of MSI-X vectors available for assignment to VFs, "0" is a perfectly legitimate value. It just means there are none. It doesn't need to be described separately. > > Do these filenames really need to contain both "sriov" and "vf"? > > This is what I was asked at some point. In previous versions the name > was without "sriov". We currently have: $ grep DEVICE_ATTR drivers/pci/iov.c static DEVICE_ATTR_RO(sriov_totalvfs); static DEVICE_ATTR_RW(sriov_numvfs); static DEVICE_ATTR_RO(sriov_offset); static DEVICE_ATTR_RO(sriov_stride); static DEVICE_ATTR_RO(sriov_vf_device); static DEVICE_ATTR_RW(sriov_drivers_autoprobe); If we put them in a new "vfs_overlay" directory, it seems like overkill to repeat the "vf" part, but I'm hoping the new files can end up next to these existing files. In that case, I think it makes sense to include "sriov". And it probably does make sense to include "vf" as well. > > Should these be next to the existing SR-IOV sysfs files, i.e., in or > > alongside sriov_dev_attr_group? > > This was suggestion in previous versions. I didn't hear anyone > supporting it. Pointer to this discussion? I'm not sure what value is added by a new directory. > > Hmmm, I see pci_enable_vfs_overlay() is called by the driver. I don't > > really like that because then we're dependent on drivers to maintain > > the PCI sysfs hierarchy. E.g., a driver might forget to call > > pci_disable_vfs_overlay(), and then a future driver load will fail. > > > > Maybe this could be done with .is_visible() functions that call driver > > callbacks. > > It was in previous versions too, but this solution allows PF to control > the VFs overlay dynamically. See below; I think it might be possible to do this dynamically even without pci_enable_vfs_overlay(). > > > +int pci_enable_vfs_overlay(struct pci_dev *dev) > > > +{ > > > + struct pci_dev *virtfn; > > > + int id, ret; > > > + > > > + if (!dev->is_physfn || !dev->sriov->num_VFs) > > > + return 0; > > > + > > > + ret = sysfs_create_group(&dev->dev.kobj, &sriov_pf_dev_attr_group); > > > + if (ret) > > > + return ret; > > > + > > > + for (id = 0; id < dev->sriov->num_VFs; id++) { > > > + virtfn = pci_get_domain_bus_and_slot( > > > + pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id), > > > + pci_iov_virtfn_devfn(dev, id)); > > > + > > > + if (!virtfn) > > > + continue; > > > + > > > + ret = sysfs_create_group(&virtfn->dev.kobj, > > > + &sriov_vf_dev_attr_group); > > > + if (ret) > > > + goto out; > > > + } > > > + return 0; > > > + > > > +out: > > > + while (id--) { > > > + virtfn = pci_get_domain_bus_and_slot( > > > + pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id), > > > + pci_iov_virtfn_devfn(dev, id)); > > > + > > > + if (!virtfn) > > > + continue; > > > + > > > + sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group); > > > + } > > > + sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group); > > > + return ret; > > > +} > > > +EXPORT_SYMBOL_GPL(pci_enable_vfs_overlay); > > > + > > > +void pci_disable_vfs_overlay(struct pci_dev *dev) > > > +{ > > > + struct pci_dev *virtfn; > > > + int id; > > > + > > > + if (!dev->is_physfn || !dev->sriov->num_VFs) > > > + return; > > > + > > > + id = dev->sriov->num_VFs; > > > + while (id--) { > > > + virtfn = pci_get_domain_bus_and_slot( > > > + pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id), > > > + pci_iov_virtfn_devfn(dev, id)); > > > + > > > + if (!virtfn) > > > + continue; > > > + > > > + sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group); > > > + } > > > + sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group); > > > +} > > > +EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay); > > > > I'm not convinced all this sysfs wrangling is necessary. If it is, > > add a hint in a comment about why this is special and can't use > > something like sriov_dev_attr_group. > > This makes the overlay to be PF-driven. Alexander insisted on this flow. If you're referring to [1], I think "insisted on this flow" might be an overstatement of what Alex was looking for. IIUC Alex wants the sysfs files to be visible only when they're useful, i.e., when a driver implements ->sriov_set_msix_vec_count(). That seems reasonable and also seems like something a smarter .is_visible() function could accomplish without having drivers call pci_enable_vfs_overlay(), e.g., maybe some variation of this: static umode_t sriov_vf_attrs_are_visible(...) { if (!pdev->msix_cap || dev_is_pf(dev)) return 0; pf = pci_physfn(pdev); if (pf->driver && pf->driver->sriov_set_msix_vec_count) return a->mode; return 0; } (Probably needs locking while we look at pf->driver, just like in pci_vf_set_msix_vec_count().) pci_enable_vfs_overlay() significantly complicates the code and it's the sort of sysfs code we're trying to avoid, so if we really do need it, please add a brief comment about *why* we have to do it that way. > > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) > > > +{ > > > + if (!dev->is_physfn) > > > + return; > > > + > > > + dev->sriov->vf_total_msix = count; > > > > The PCI core doesn't use vf_total_msix at all. The driver, e.g., > > mlx5, calls this, and all the PCI core does is hang onto the value and > > expose it via sysfs. I think I'd rather have a callback in struct > > pci_driver and let the driver supply the value when needed. I.e., > > sriov_vf_total_msix_show() would call the callback instead of looking > > at pdev->sriov->vf_total_msix. > > It will cause to unnecessary locking to ensure that driver doesn't > vanish during sysfs read. I can change, but don't think that it is right > decision. Doesn't sysfs already ensure the driver can't vanish while we're executing a DEVICE_ATTR accessor? > > > +int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count) > > > +{ > > > + struct pci_dev *pdev = pci_physfn(dev); > > > + int ret; > > > + > > > + if (count < 0) > > > + /* > > > + * We don't support negative numbers for now, > > > + * but maybe in the future it will make sense. > > > + */ > > > > Drop the comment; I don't think it adds useful information. > > > > > + return -EINVAL; > > > + > > > + device_lock(&pdev->dev); > > > + if (!pdev->driver) { > > > + ret = -EOPNOTSUPP; > > > + goto err_pdev; > > > + } > > > + > > > + device_lock(&dev->dev); > > > + if (dev->driver) { > > > + /* > > > + * Driver already probed this VF and configured itself > > > + * based on previously configured (or default) MSI-X vector > > > + * count. It is too late to change this field for this > > > + * specific VF. > > > + */ > > > + ret = -EBUSY; > > > + goto err_dev; > > > + } > > > + > > > + ret = pdev->driver->sriov_set_msix_vec_count(dev, count); > > > > This looks like a NULL pointer dereference. > > In current code, it is impossible, the call to pci_vf_set_msix_vec_count() > will be only for devices that supports sriov_vf_msix_count which is > possible with ->sriov_set_msix_vec_count() only. OK. I think you're right, but it takes quite a lot of analysis to prove that right now. If the .is_visible() function for sriov_vf_msix_count checked whether the driver implements ->sriov_set_msix_vec_count(), it would be quite a bit easier, and it might even help get rid of pci_enable_vfs_overlay(). Also, pci_vf_set_msix_vec_count() is in pci/msi.c, but AFAICT there's no actual *reason* for it to be there other than the fact that it has "msix" in the name. It uses no MSI data structures. Maybe it could be folded into sriov_vf_msix_count_store(), which would make the analysis even easier. > > > @@ -326,6 +327,9 @@ struct pci_sriov { > > > u16 subsystem_device; /* VF subsystem device */ > > > resource_size_t barsz[PCI_SRIOV_NUM_BARS]; /* VF BAR size */ > > > bool drivers_autoprobe; /* Auto probing of VFs by driver */ > > > + u32 vf_total_msix; /* Total number of MSI-X vectors the VFs > > > + * can consume > > > + */ > > > > * can consume */ > > > > Hopefully you can eliminate vf_total_msix altogether. > > I think that will be worse from the flow point of view (extra locks) > and the memory if you are worried about it. This variable consumes 4 > bytes, the pointer to the function will take 8 bytes. I'm not concerned about the space. The function pointer is in struct pci_driver, whereas the variable is in struct pci_sriov, so the variable would likely consume more space because there are probably more VFs than pci_drivers. My bigger concern is that vf_total_msix is really *driver* state, not PCI core state, and I'd prefer if the PCI core were not responsible for it. > > > +++ b/include/linux/pci.h > > > @@ -856,6 +856,8 @@ struct module; > > > * e.g. drivers/net/e100.c. > > > * @sriov_configure: Optional driver callback to allow configuration of > > > * number of VFs to enable via sysfs "sriov_numvfs" file. > > > + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors > > > + * exposed by the sysfs "vf_msix_vec" entry. > > > > "vf_msix_vec" is apparently stale? There's no other reference in this > > patch. > > > > I think the important part is that this changes the number of vectors > > advertised in the VF's MSI-X Message Control register, which will be > > read when the VF driver enables MSI-X. > > > > If that's true, why would we expose this via a sysfs file? We can > > easily read it via lspci. > > I did it for feature complete, we don't need read of this sysfs. If you don't need this sysfs file, just remove it. If you do need it, please update the "vf_msix_vec" so it's correct. Also, clarify the description so we can tell that we're changing the Table Size VFs will see in their Message Control registers. > > > +static inline void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) {} > > > > Also here. Unless removing the space would make this fit in 80 > > columns. > > Yes, it is 80 columns without space. No, I don't think it is. In an 80 column window, the result looks like: static inline void pci_sriov_set_vf_total_msix(...) { } Please change it so it looks like this so it matches the rest of the file: static inline void pci_sriov_set_vf_total_msix(...) { } Bjorn [1] https://lore.kernel.org/r/CAKgT0UcJQ3uy6J_CCLizDLfzGL2saa_PjOYH4nK+RQjfmpNA=w@mail.gmail.com
On Wed, Feb 03, 2021 at 06:10:39PM -0600, Bjorn Helgaas wrote: > On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote: > > On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote: > > > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote: > > > > From: Leon Romanovsky <leonro@nvidia.com> > > > > > > > > Extend PCI sysfs interface with a new callback that allows > > > > configure the number of MSI-X vectors for specific SR-IO VF. > > > > This is needed to optimize the performance of newly bound > > > > devices by allocating the number of vectors based on the > > > > administrator knowledge of targeted VM. > > > > > > I'm reading between the lines here, but IIUC the point is that you > > > have a PF that supports a finite number of MSI-X vectors for use > > > by all the VFs, and this interface is to control the distribution > > > of those MSI-X vectors among the VFs. > > > > The MSI-X is HW resource, all devices in the world have limitation > > here. > > > > > > This function is applicable for SR-IOV VF because such devices > > > > allocate their MSI-X table before they will run on the VMs and > > > > HW can't guess the right number of vectors, so the HW allocates > > > > them statically and equally. > > > > > > This is written in a way that suggests this is behavior required > > > by the PCIe spec. If it is indeed related to something in the > > > spec, please cite it. > > > > Spec doesn't say it directly, but you will need to really hurt brain > > of your users if you decide to do it differently. You have one > > enable bit to create all VFs at the same time without any option to > > configure them in advance. > > > > Of course, you can create some partition map, upload it to FW and > > create from there. > > Of course all devices have limitations. But let's add some details > about the way *this* device works. That will help avoid giving the > impression that this is the *only* way spec-conforming devices can > work. Sure > > > > "such devices allocate their MSI-X table before they will run on > > > the VMs": Let's be specific here. This MSI-X Table allocation > > > apparently doesn't happen when we set VF Enable in the PF, because > > > these sysfs files are attached to the VFs, which don't exist yet. > > > It's not the VF driver binding, because that's a software > > > construct. What is the hardware event that triggers the > > > allocation? > > > > Write of MSI-X vector count to the FW through PF. > > This is an example of something that is obviously specific to this > mlx5 device. The Table Size field in Message Control is RO per spec, > and obviously firmware on the device is completely outside the scope > of the PCIe spec. > > This commit log should describe how *this* device manages this > allocation and how the PF Table Size and the VF Table Sizes are > related. Per PCIe, there is no necessary connection between them. There is no connection in mlx5 devices either. PF is used as a vehicle to access VF that doesn't have driver yet. From "table size" perspective they completely independent, because PF already probed by driver and it is already too late to change it. So PF table size is static and can be changed through FW utility only. > > > > > cat /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix > > > > = 0 - feature is not supported > > > > > 0 - total number of MSI-X vectors to consume by the VFs > > > > > > "total number of MSI-X vectors available for distribution among the > > > VFs"? > > > > Users need to be aware of how much vectors exist in the system. > > Understood -- if there's an interface to influence the distribution of > vectors among VFs, one needs to know how many vectors there are to > work with. > > My point was that "number of vectors to consume by VFs" is awkward > wording, so I suggested an alternative. Got it, thanks > > > > > +What: /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count > > > > +Date: January 2021 > > > > +Contact: Leon Romanovsky <leonro@nvidia.com> > > > > +Description: > > > > + This file is associated with the SR-IOV VFs. > > > > + It allows configuration of the number of MSI-X vectors for > > > > + the VF. This is needed to optimize performance of newly bound > > > > + devices by allocating the number of vectors based on the > > > > + administrator knowledge of targeted VM. > > > > + > > > > + The values accepted are: > > > > + * > 0 - this will be number reported by the VF's MSI-X > > > > + capability > > > > + * < 0 - not valid > > > > + * = 0 - will reset to the device default value > > > > + > > > > + The file is writable if the PF is bound to a driver that > > > > + set sriov_vf_total_msix > 0 and there is no driver bound > > > > + to the VF. > > Drivers don't actually set "sriov_vf_total_msix". This should > probably say something like "the PF is bound to a driver that > implements ->sriov_set_msix_vec_count()." Sure, will change. > > > > > +What: /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix > > > > +Date: January 2021 > > > > +Contact: Leon Romanovsky <leonro@nvidia.com> > > > > +Description: > > > > + This file is associated with the SR-IOV PFs. > > > > + It returns a total number of possible to configure MSI-X > > > > + vectors on the enabled VFs. > > > > + > > > > + The values returned are: > > > > + * > 0 - this will be total number possible to consume by VFs, > > > > + * = 0 - feature is not supported > > Can you swap the order of these two files in the documentation? > sriov_vf_total_msix is associated with the PF and logically precedes > sriov_vf_msix_count, which is associated with VFs. I'll do. > > Not sure the "= 0" description is necessary here. If the value > returned is the number of MSI-X vectors available for assignment to > VFs, "0" is a perfectly legitimate value. It just means there are > none. It doesn't need to be described separately. I wanted to help users and remove ambiguity. For example, mlx5 drivers will always implement proper .set_...() callbacks but for some devices without needed FW support, the value will be 0. Instead of misleading users with wrong promise that feature supported but doesn't have available vectors, I decided to be more clear. For the users, 0 means, don't try, it is not working. > > > > Do these filenames really need to contain both "sriov" and "vf"? > > > > This is what I was asked at some point. In previous versions the name > > was without "sriov". > > We currently have: > > $ grep DEVICE_ATTR drivers/pci/iov.c > static DEVICE_ATTR_RO(sriov_totalvfs); > static DEVICE_ATTR_RW(sriov_numvfs); > static DEVICE_ATTR_RO(sriov_offset); > static DEVICE_ATTR_RO(sriov_stride); > static DEVICE_ATTR_RO(sriov_vf_device); > static DEVICE_ATTR_RW(sriov_drivers_autoprobe); > > If we put them in a new "vfs_overlay" directory, it seems like > overkill to repeat the "vf" part, but I'm hoping the new files can end > up next to these existing files. In that case, I think it makes sense > to include "sriov". And it probably does make sense to include "vf" > as well. I put everything in folder to group any possible future extensions. Those extensions are applicable to SR-IOV VFs only, IMHO they deserve separate folder. This is the link with the request to change name: https://lore.kernel.org/linux-pci/CAKgT0UcJrSNMPAOoniRSnUn+wyRUkL62AfgR3-8QbAkak=pQ=w@mail.gmail.com/ Also, _vf_ in the sriov_vf_total_msix symbolize that we are talking explicitly about VFs and not whole device/PF. > > > > Should these be next to the existing SR-IOV sysfs files, i.e., in or > > > alongside sriov_dev_attr_group? > > > > This was suggestion in previous versions. I didn't hear anyone > > supporting it. > > Pointer to this discussion? I'm not sure what value is added by a new > directory. In the link below, Alex talks about creating PF driven sysfs entries clearly visible by the users as not PCI config space writes. The overlay folder achieves that. https://lore.kernel.org/linux-pci/CAKgT0UeYb5xz8iehE1Y0s-cyFbsy46bjF83BkA7qWZMkAOLR-g@mail.gmail.com/ and this is the resolution that proposed structure is OK: https://lore.kernel.org/linux-pci/20210124190032.GD5038@unreal/ > > > > Hmmm, I see pci_enable_vfs_overlay() is called by the driver. I don't > > > really like that because then we're dependent on drivers to maintain > > > the PCI sysfs hierarchy. E.g., a driver might forget to call > > > pci_disable_vfs_overlay(), and then a future driver load will fail. > > > > > > Maybe this could be done with .is_visible() functions that call driver > > > callbacks. > > > > It was in previous versions too, but this solution allows PF to control > > the VFs overlay dynamically. > > See below; I think it might be possible to do this dynamically even > without pci_enable_vfs_overlay(). The request was to have this overlay be PF driven. It means that if PF driver is removed, the folder should disappear. I tried to do it with .si_visible() and found myself adding hooks in general pci remove flow. It was far from clean. > > > > > +int pci_enable_vfs_overlay(struct pci_dev *dev) > > > > +{ > > > > + struct pci_dev *virtfn; > > > > + int id, ret; > > > > + > > > > + if (!dev->is_physfn || !dev->sriov->num_VFs) > > > > + return 0; > > > > + > > > > + ret = sysfs_create_group(&dev->dev.kobj, &sriov_pf_dev_attr_group); > > > > + if (ret) > > > > + return ret; > > > > + > > > > + for (id = 0; id < dev->sriov->num_VFs; id++) { > > > > + virtfn = pci_get_domain_bus_and_slot( > > > > + pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id), > > > > + pci_iov_virtfn_devfn(dev, id)); > > > > + > > > > + if (!virtfn) > > > > + continue; > > > > + > > > > + ret = sysfs_create_group(&virtfn->dev.kobj, > > > > + &sriov_vf_dev_attr_group); > > > > + if (ret) > > > > + goto out; > > > > + } > > > > + return 0; > > > > + > > > > +out: > > > > + while (id--) { > > > > + virtfn = pci_get_domain_bus_and_slot( > > > > + pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id), > > > > + pci_iov_virtfn_devfn(dev, id)); > > > > + > > > > + if (!virtfn) > > > > + continue; > > > > + > > > > + sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group); > > > > + } > > > > + sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group); > > > > + return ret; > > > > +} > > > > +EXPORT_SYMBOL_GPL(pci_enable_vfs_overlay); > > > > + > > > > +void pci_disable_vfs_overlay(struct pci_dev *dev) > > > > +{ > > > > + struct pci_dev *virtfn; > > > > + int id; > > > > + > > > > + if (!dev->is_physfn || !dev->sriov->num_VFs) > > > > + return; > > > > + > > > > + id = dev->sriov->num_VFs; > > > > + while (id--) { > > > > + virtfn = pci_get_domain_bus_and_slot( > > > > + pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id), > > > > + pci_iov_virtfn_devfn(dev, id)); > > > > + > > > > + if (!virtfn) > > > > + continue; > > > > + > > > > + sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group); > > > > + } > > > > + sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group); > > > > +} > > > > +EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay); > > > > > > I'm not convinced all this sysfs wrangling is necessary. If it is, > > > add a hint in a comment about why this is special and can't use > > > something like sriov_dev_attr_group. > > > > This makes the overlay to be PF-driven. Alexander insisted on this flow. > > If you're referring to [1], I think "insisted on this flow" might be > an overstatement of what Alex was looking for. IIUC Alex wants the > sysfs files to be visible only when they're useful, i.e., when a > driver implements ->sriov_set_msix_vec_count(). It is only one side of the request, the sysfs files shouldn't be visible if PF driver was removed and visible again when its probed again. > > That seems reasonable and also seems like something a smarter > .is_visible() function could accomplish without having drivers call > pci_enable_vfs_overlay(), e.g., maybe some variation of this: > > static umode_t sriov_vf_attrs_are_visible(...) > { > if (!pdev->msix_cap || dev_is_pf(dev)) > return 0; > > pf = pci_physfn(pdev); > if (pf->driver && pf->driver->sriov_set_msix_vec_count) > return a->mode; > > return 0; > } It doesn't work with the following flow: 1. load driver 2. disable autoprobe 3. echo to sriov_numvfs .... <--- you have this sriov_vf_attrs_are_visible() created 4. unload driver .... <--- sysfs still exists despite not having PF driver. > > (Probably needs locking while we look at pf->driver, just like in > pci_vf_set_msix_vec_count().) > > pci_enable_vfs_overlay() significantly complicates the code and it's > the sort of sysfs code we're trying to avoid, so if we really do need > it, please add a brief comment about *why* we have to do it that way. I will add. > > > > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) > > > > +{ > > > > + if (!dev->is_physfn) > > > > + return; > > > > + > > > > + dev->sriov->vf_total_msix = count; > > > > > > The PCI core doesn't use vf_total_msix at all. The driver, e.g., > > > mlx5, calls this, and all the PCI core does is hang onto the value and > > > expose it via sysfs. I think I'd rather have a callback in struct > > > pci_driver and let the driver supply the value when needed. I.e., > > > sriov_vf_total_msix_show() would call the callback instead of looking > > > at pdev->sriov->vf_total_msix. > > > > It will cause to unnecessary locking to ensure that driver doesn't > > vanish during sysfs read. I can change, but don't think that it is right > > decision. > > Doesn't sysfs already ensure the driver can't vanish while we're > executing a DEVICE_ATTR accessor? It is not, you can see it by adding device_lock_held() check in any .show attribute. See drivers/base/core.c: dev_attr_show(), it doesn't do much. This is why pci_vf_set_msix_vec_count() has double lock. > > > > > +int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count) > > > > +{ > > > > + struct pci_dev *pdev = pci_physfn(dev); > > > > + int ret; > > > > + > > > > + if (count < 0) > > > > + /* > > > > + * We don't support negative numbers for now, > > > > + * but maybe in the future it will make sense. > > > > + */ > > > > > > Drop the comment; I don't think it adds useful information. > > > > > > > + return -EINVAL; > > > > + > > > > + device_lock(&pdev->dev); > > > > + if (!pdev->driver) { > > > > + ret = -EOPNOTSUPP; > > > > + goto err_pdev; > > > > + } > > > > + > > > > + device_lock(&dev->dev); > > > > + if (dev->driver) { > > > > + /* > > > > + * Driver already probed this VF and configured itself > > > > + * based on previously configured (or default) MSI-X vector > > > > + * count. It is too late to change this field for this > > > > + * specific VF. > > > > + */ > > > > + ret = -EBUSY; > > > > + goto err_dev; > > > > + } > > > > + > > > > + ret = pdev->driver->sriov_set_msix_vec_count(dev, count); > > > > > > This looks like a NULL pointer dereference. > > > > In current code, it is impossible, the call to pci_vf_set_msix_vec_count() > > will be only for devices that supports sriov_vf_msix_count which is > > possible with ->sriov_set_msix_vec_count() only. > > OK. I think you're right, but it takes quite a lot of analysis to > prove that right now. If the .is_visible() function for > sriov_vf_msix_count checked whether the driver implements > ->sriov_set_msix_vec_count(), it would be quite a bit easier, > and it might even help get rid of pci_enable_vfs_overlay(). > > Also, pci_vf_set_msix_vec_count() is in pci/msi.c, but AFAICT there's > no actual *reason* for it to be there other than the fact that it has > "msix" in the name. It uses no MSI data structures. Maybe it could > be folded into sriov_vf_msix_count_store(), which would make the > analysis even easier. I put _set_ command near _get_ command, but I can move it to iov.c > > > > > @@ -326,6 +327,9 @@ struct pci_sriov { > > > > u16 subsystem_device; /* VF subsystem device */ > > > > resource_size_t barsz[PCI_SRIOV_NUM_BARS]; /* VF BAR size */ > > > > bool drivers_autoprobe; /* Auto probing of VFs by driver */ > > > > + u32 vf_total_msix; /* Total number of MSI-X vectors the VFs > > > > + * can consume > > > > + */ > > > > > > * can consume */ > > > > > > Hopefully you can eliminate vf_total_msix altogether. > > > > I think that will be worse from the flow point of view (extra locks) > > and the memory if you are worried about it. This variable consumes 4 > > bytes, the pointer to the function will take 8 bytes. > > I'm not concerned about the space. The function pointer is in struct > pci_driver, whereas the variable is in struct pci_sriov, so the > variable would likely consume more space because there are probably > more VFs than pci_drivers. > > My bigger concern is that vf_total_msix is really *driver* state, not > PCI core state, and I'd prefer if the PCI core were not responsible > for it. As i said above, I can do it, just don't think that it is right. Should I do it or not? > > > > > +++ b/include/linux/pci.h > > > > @@ -856,6 +856,8 @@ struct module; > > > > * e.g. drivers/net/e100.c. > > > > * @sriov_configure: Optional driver callback to allow configuration of > > > > * number of VFs to enable via sysfs "sriov_numvfs" file. > > > > + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors > > > > + * exposed by the sysfs "vf_msix_vec" entry. > > > > > > "vf_msix_vec" is apparently stale? There's no other reference in this > > > patch. > > > > > > I think the important part is that this changes the number of vectors > > > advertised in the VF's MSI-X Message Control register, which will be > > > read when the VF driver enables MSI-X. > > > > > > If that's true, why would we expose this via a sysfs file? We can > > > easily read it via lspci. > > > > I did it for feature complete, we don't need read of this sysfs. > > If you don't need this sysfs file, just remove it. If you do need it, > please update the "vf_msix_vec" so it's correct. Also, clarify the > description so we can tell that we're changing the Table Size VFs will > see in their Message Control registers. I'll do > > > > > +static inline void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) {} > > > > > > Also here. Unless removing the space would make this fit in 80 > > > columns. > > > > Yes, it is 80 columns without space. > > No, I don't think it is. In an 80 column window, the result looks > like: > > static inline void pci_sriov_set_vf_total_msix(...) { > } > > Please change it so it looks like this so it matches the rest of the > file: > > static inline void pci_sriov_set_vf_total_msix(...) > { } I counted back then, but sure will check again. Thanks for your review. > > Bjorn > > [1] https://lore.kernel.org/r/CAKgT0UcJQ3uy6J_CCLizDLfzGL2saa_PjOYH4nK+RQjfmpNA=w@mail.gmail.com
On Thu, Feb 04, 2021 at 05:50:48PM +0200, Leon Romanovsky wrote: > On Wed, Feb 03, 2021 at 06:10:39PM -0600, Bjorn Helgaas wrote: > > On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote: > > > On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote: > > > > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote: > > > > > From: Leon Romanovsky <leonro@nvidia.com> > > > > > > > > > > Extend PCI sysfs interface with a new callback that allows > > > > > configure the number of MSI-X vectors for specific SR-IO VF. > > > > > This is needed to optimize the performance of newly bound > > > > > devices by allocating the number of vectors based on the > > > > > administrator knowledge of targeted VM. > > > > > > > > I'm reading between the lines here, but IIUC the point is that you > > > > have a PF that supports a finite number of MSI-X vectors for use > > > > by all the VFs, and this interface is to control the distribution > > > > of those MSI-X vectors among the VFs. > > This commit log should describe how *this* device manages this > > allocation and how the PF Table Size and the VF Table Sizes are > > related. Per PCIe, there is no necessary connection between them. > > There is no connection in mlx5 devices either. PF is used as a vehicle > to access VF that doesn't have driver yet. From "table size" perspective > they completely independent, because PF already probed by driver and > it is already too late to change it. > > So PF table size is static and can be changed through FW utility only. This is where description of the device would be useful. The fact that you need "sriov_vf_total_msix" to advertise how many vectors are available and "sriov_vf_msix_count" to influence how they are distributed across the VFs suggests that these Table Sizes are not completely independent. Can a VF have a bigger Table Size than the PF does? Can all the VF Table Sizes added together be bigger than the PF Table Size? If VF A has a larger Table Size, does that mean VF B must have a smaller Table Size? Obviously I do not understand the details about how this device works. It would be helpful to have those details here. Here's the sequence as I understand it: 1) PF driver binds to PF 2) PF driver enables VFs 3) PF driver creates /sys/.../<PF>/sriov_vf_total_msix 4) PF driver creates /sys/.../<VFn>/sriov_vf_msix_count for each VF 5) Management app reads sriov_vf_total_msix, writes sriov_vf_msix_count 6) VF driver binds to VF 7) VF reads MSI-X Message Control (Table Size) Is it true that "lspci VF" at 4.1 and "lspci VF" at 5.1 may read different Table Sizes? That would be a little weird. I'm also a little concerned about doing 2 before 3 & 4. That works for mlx5 because implements the Table Size adjustment in a way that works *after* the VFs have been enabled. But it seems conceivable that a device could implement vector distribution in a way that would require the VF Table Sizes to be fixed *before* enabling VFs. That would be nice in the sense that the VFs would be created "fully formed" and the VF Table Size would be completely read-only as documented. The other knob idea you mentioned at [2]: echo "0000:01:00.2 123" > sriov_vf_msix_count would have the advantage of working for both cases. That's definitely more complicated, but at the same time, I would hate to carve a sysfs interface into stone if it might not work for other devices. > > > > > +What: /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix > > > > > +Date: January 2021 > > > > > +Contact: Leon Romanovsky <leonro@nvidia.com> > > > > > +Description: > > > > > + This file is associated with the SR-IOV PFs. > > > > > + It returns a total number of possible to configure MSI-X > > > > > + vectors on the enabled VFs. > > > > > + > > > > > + The values returned are: > > > > > + * > 0 - this will be total number possible to consume by VFs, > > > > > + * = 0 - feature is not supported > > > Not sure the "= 0" description is necessary here. If the value > > returned is the number of MSI-X vectors available for assignment to > > VFs, "0" is a perfectly legitimate value. It just means there are > > none. It doesn't need to be described separately. > > I wanted to help users and remove ambiguity. For example, mlx5 drivers > will always implement proper .set_...() callbacks but for some devices > without needed FW support, the value will be 0. Instead of misleading > users with wrong promise that feature supported but doesn't have > available vectors, I decided to be more clear. For the users, 0 means, don't > try, it is not working. Oh, you mean "feature is not supported by the FIRMWARE on some mlx5 devices"? I totally missed that; I thought you meant "not supported by the PF driver." Why not have the PF driver detect that the firmware doesn't support the feature and just not expose the sysfs files at all in that case? > > If we put them in a new "vfs_overlay" directory, it seems like > > overkill to repeat the "vf" part, but I'm hoping the new files can end > > up next to these existing files. In that case, I think it makes sense > > to include "sriov". And it probably does make sense to include "vf" > > as well. > > I put everything in folder to group any possible future extensions. > Those extensions are applicable to SR-IOV VFs only, IMHO they deserve > separate folder. I'm not convinced (yet) that the possibility of future extensions is enough justification for adding the "vfs_overlay" directory. It really complicates the code flow -- if we skipped the new directory, I'm pretty sure we could make .is_visible() work, which would be a major simplification. And there's quite a bit of value in the new files being right next to the existing sriov_* files. > > > > > +void pci_disable_vfs_overlay(struct pci_dev *dev) > > > > > +{ > > > > > + struct pci_dev *virtfn; > > > > > + int id; > > > > > + > > > > > + if (!dev->is_physfn || !dev->sriov->num_VFs) > > > > > + return; > > > > > + > > > > > + id = dev->sriov->num_VFs; > > > > > + while (id--) { > > > > > + virtfn = pci_get_domain_bus_and_slot( > > > > > + pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id), > > > > > + pci_iov_virtfn_devfn(dev, id)); > > > > > + > > > > > + if (!virtfn) > > > > > + continue; > > > > > + > > > > > + sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group); > > > > > + } > > > > > + sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group); > > > > > +} > > > > > +EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay); > > > > > > > > I'm not convinced all this sysfs wrangling is necessary. If it is, > > > > add a hint in a comment about why this is special and can't use > > > > something like sriov_dev_attr_group. > > > > > > This makes the overlay to be PF-driven. Alexander insisted on this flow. > > > > If you're referring to [1], I think "insisted on this flow" might be > > an overstatement of what Alex was looking for. IIUC Alex wants the > > sysfs files to be visible only when they're useful, i.e., when a > > driver implements ->sriov_set_msix_vec_count(). > > It is only one side of the request, the sysfs files shouldn't be visible > if PF driver was removed and visible again when its probed again. I can't parse this, but it's probably related to the question below. > > That seems reasonable and also seems like something a smarter > > .is_visible() function could accomplish without having drivers call > > pci_enable_vfs_overlay(), e.g., maybe some variation of this: > > > > static umode_t sriov_vf_attrs_are_visible(...) > > { > > if (!pdev->msix_cap || dev_is_pf(dev)) > > return 0; > > > > pf = pci_physfn(pdev); > > if (pf->driver && pf->driver->sriov_set_msix_vec_count) > > return a->mode; > > > > return 0; > > } > > It doesn't work with the following flow: > 1. load driver > 2. disable autoprobe > 3. echo to sriov_numvfs > .... <--- you have this sriov_vf_attrs_are_visible() created > 4. unload driver > .... <--- sysfs still exists despite not having PF driver. I missed your point here, sorry. After unloading the PF driver, "pf->driver" in the sketch above will be NULL, so the VF sysfs file would not be visible. Right? Maybe it has to do with autoprobe? I didn't catch what the significance of disabling autoprobe was. > > > > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) > > > > > +{ > > > > > + if (!dev->is_physfn) > > > > > + return; > > > > > + > > > > > + dev->sriov->vf_total_msix = count; > > > > > > > > The PCI core doesn't use vf_total_msix at all. The driver, e.g., > > > > mlx5, calls this, and all the PCI core does is hang onto the value and > > > > expose it via sysfs. I think I'd rather have a callback in struct > > > > pci_driver and let the driver supply the value when needed. I.e., > > > > sriov_vf_total_msix_show() would call the callback instead of looking > > > > at pdev->sriov->vf_total_msix. > > > > > > It will cause to unnecessary locking to ensure that driver doesn't > > > vanish during sysfs read. I can change, but don't think that it is right > > > decision. > > > > Doesn't sysfs already ensure the driver can't vanish while we're > > executing a DEVICE_ATTR accessor? > > It is not, you can see it by adding device_lock_held() check in any > .show attribute. See drivers/base/core.c: dev_attr_show(), it doesn't > do much. This is why pci_vf_set_msix_vec_count() has double lock. Aahh, right, I learned something today, thanks! There are only a few PCI sysfs attributes that reference the driver, and they do their own locking. I do think vf_total_msix is a bit of driver state related to implementing this functionality and doesn't need to be in the PCI core. I think device locking is acceptable; it's very similar to what is done in sriov_numvfs_store(). Doing the locking and calling a driver callback makes it obvious that vf_total_msix is part of this PF driver-specific functionality, not a generic part of the PCI core. So let's give it a try. If it turns out to be terrible, we can revisit it. > > Also, pci_vf_set_msix_vec_count() is in pci/msi.c, but AFAICT there's > > no actual *reason* for it to be there other than the fact that it has > > "msix" in the name. It uses no MSI data structures. Maybe it could > > be folded into sriov_vf_msix_count_store(), which would make the > > analysis even easier. > > I put _set_ command near _get_ command, but I can move it to iov.c You mean you put pci_vf_set_msix_vec_count() near pci_msix_vec_count()? That's *true*, but they are not analogues, and one is not setting the value returned by the other. pci_vf_set_msix_vec_count() is a completely magical thing that uses a device-specific mechanism on a PF that happens to change what pci_msix_vec_count() on a VF will return later. I think this is more related to SR-IOV than it is to MSI. Bjorn > > [1] https://lore.kernel.org/r/CAKgT0UcJQ3uy6J_CCLizDLfzGL2saa_PjOYH4nK+RQjfmpNA=w@mail.gmail.com [2] https://lore.kernel.org/linux-pci/20210118132800.GA4835@unreal/
On Thu, Feb 04, 2021 at 03:12:12PM -0600, Bjorn Helgaas wrote: > On Thu, Feb 04, 2021 at 05:50:48PM +0200, Leon Romanovsky wrote: > > On Wed, Feb 03, 2021 at 06:10:39PM -0600, Bjorn Helgaas wrote: > > > On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote: > > > > On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote: > > > > > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote: > > > > > > From: Leon Romanovsky <leonro@nvidia.com> > > > > > > > > > > > > Extend PCI sysfs interface with a new callback that allows > > > > > > configure the number of MSI-X vectors for specific SR-IO VF. > > > > > > This is needed to optimize the performance of newly bound > > > > > > devices by allocating the number of vectors based on the > > > > > > administrator knowledge of targeted VM. > > > > > > > > > > I'm reading between the lines here, but IIUC the point is that you > > > > > have a PF that supports a finite number of MSI-X vectors for use > > > > > by all the VFs, and this interface is to control the distribution > > > > > of those MSI-X vectors among the VFs. > > > > This commit log should describe how *this* device manages this > > > allocation and how the PF Table Size and the VF Table Sizes are > > > related. Per PCIe, there is no necessary connection between them. > > > > There is no connection in mlx5 devices either. PF is used as a vehicle > > to access VF that doesn't have driver yet. From "table size" perspective > > they completely independent, because PF already probed by driver and > > it is already too late to change it. > > > > So PF table size is static and can be changed through FW utility only. > > This is where description of the device would be useful. > > The fact that you need "sriov_vf_total_msix" to advertise how many > vectors are available and "sriov_vf_msix_count" to influence how they > are distributed across the VFs suggests that these Table Sizes are not > completely independent. > > Can a VF have a bigger Table Size than the PF does? Can all the VF > Table Sizes added together be bigger than the PF Table Size? If VF A > has a larger Table Size, does that mean VF B must have a smaller Table > Size? VFs are completely independent devices and their table size can be bigger than PF. FW has two pools, one for PF and another for all VFs. In real world scenarios, every VF will have more MSI-X vectors than PF, which will be distributed by orchestration software. In theory, users can assign almost all vectors to one VF and leave others depleted. In practice, it is not possible, we ensure that all VFs start with some sensible number of vectors and FW protects with check of max vector size write. > > Obviously I do not understand the details about how this device works. > It would be helpful to have those details here. > > Here's the sequence as I understand it: > > 1) PF driver binds to PF > 2) PF driver enables VFs > 3) PF driver creates /sys/.../<PF>/sriov_vf_total_msix > 4) PF driver creates /sys/.../<VFn>/sriov_vf_msix_count for each VF > 5) Management app reads sriov_vf_total_msix, writes sriov_vf_msix_count > 6) VF driver binds to VF > 7) VF reads MSI-X Message Control (Table Size) > > Is it true that "lspci VF" at 4.1 and "lspci VF" at 5.1 may read > different Table Sizes? That would be a little weird. Yes, this is the flow. I think differently from you and think this is actual good thing that user writes new msix count and it is shown immediately. > > I'm also a little concerned about doing 2 before 3 & 4. That works > for mlx5 because implements the Table Size adjustment in a way that > works *after* the VFs have been enabled. It is not related to mlx5, but to the PCI spec that requires us to create all VFs at the same time. Before enabling VFs, they don't exist. > > But it seems conceivable that a device could implement vector > distribution in a way that would require the VF Table Sizes to be > fixed *before* enabling VFs. That would be nice in the sense that the > VFs would be created "fully formed" and the VF Table Size would be > completely read-only as documented. It is not how SR-IOV is used in real world. Cloud providers create many VFs but don't assign those VFs yet. They do it after customer request VM with specific properties (number of CPUs) which means number of MSI-X vectors in our case. All this is done when other VFs already in use and we can't destroy and recreate them at that point of time. > > The other knob idea you mentioned at [2]: > > echo "0000:01:00.2 123" > sriov_vf_msix_count This knob doesn't always work if you have many devices and it is nightmare to guess "new" VF BDF before it is claimed. As an example on my machine with two devices, VFs are created differently: [root@c ~]$ lspci |grep nox 08:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] 08:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5] [root@c ~]# echo 2 > /sys/bus/pci/devices/0000\:08\:00.1/sriov_numvfs [root@c ~]# lspci |grep nox 08:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] 08:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5] 08:01.2 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function] 08:01.3 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function] [root@c ~]# echo 0 > /sys/bus/pci/devices/0000\:08\:00.1/sriov_numvfs [root@c ~]# echo 2 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs [root@c ~]# lspci |grep nox 08:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] 08:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5] 08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function] 08:00.3 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function] > > would have the advantage of working for both cases. That's definitely > more complicated, but at the same time, I would hate to carve a sysfs > interface into stone if it might not work for other devices. As you can see above, it is not really solves pre-creation configuration flow. For such flows, probably devlink is the best choice. > > > > > > > +What: /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix > > > > > > +Date: January 2021 > > > > > > +Contact: Leon Romanovsky <leonro@nvidia.com> > > > > > > +Description: > > > > > > + This file is associated with the SR-IOV PFs. > > > > > > + It returns a total number of possible to configure MSI-X > > > > > > + vectors on the enabled VFs. > > > > > > + > > > > > > + The values returned are: > > > > > > + * > 0 - this will be total number possible to consume by VFs, > > > > > > + * = 0 - feature is not supported > > > > > Not sure the "= 0" description is necessary here. If the value > > > returned is the number of MSI-X vectors available for assignment to > > > VFs, "0" is a perfectly legitimate value. It just means there are > > > none. It doesn't need to be described separately. > > > > I wanted to help users and remove ambiguity. For example, mlx5 drivers > > will always implement proper .set_...() callbacks but for some devices > > without needed FW support, the value will be 0. Instead of misleading > > users with wrong promise that feature supported but doesn't have > > available vectors, I decided to be more clear. For the users, 0 means, don't > > try, it is not working. > > Oh, you mean "feature is not supported by the FIRMWARE on some mlx5 > devices"? I totally missed that; I thought you meant "not supported > by the PF driver." Why not have the PF driver detect that the > firmware doesn't support the feature and just not expose the sysfs > files at all in that case? I can do it and will do it, but this behavior will be possible because mlx5 is flexible enough. I imagine that other device won't be so flexible, for example they will decide to "close" this feature because of other SW limitation. > > > > If we put them in a new "vfs_overlay" directory, it seems like > > > overkill to repeat the "vf" part, but I'm hoping the new files can end > > > up next to these existing files. In that case, I think it makes sense > > > to include "sriov". And it probably does make sense to include "vf" > > > as well. > > > > I put everything in folder to group any possible future extensions. > > Those extensions are applicable to SR-IOV VFs only, IMHO they deserve > > separate folder. > > I'm not convinced (yet) that the possibility of future extensions is > enough justification for adding the "vfs_overlay" directory. It > really complicates the code flow -- if we skipped the new directory, > I'm pretty sure we could make .is_visible() work, which would be a > major simplification. Unfortunately not, is_visible() is not dynamic and called when device creates sysfs and it doesn't rescan it or refresh them. It means that all files from vfs_overlay folder will exist even driver is removed from PF. Also sysfs is created before driver is probed. > > And there's quite a bit of value in the new files being right next to > the existing sriov_* files. I have no problem to drop folder, just need clear request, should I. > > > > > > > +void pci_disable_vfs_overlay(struct pci_dev *dev) > > > > > > +{ > > > > > > + struct pci_dev *virtfn; > > > > > > + int id; > > > > > > + > > > > > > + if (!dev->is_physfn || !dev->sriov->num_VFs) > > > > > > + return; > > > > > > + > > > > > > + id = dev->sriov->num_VFs; > > > > > > + while (id--) { > > > > > > + virtfn = pci_get_domain_bus_and_slot( > > > > > > + pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id), > > > > > > + pci_iov_virtfn_devfn(dev, id)); > > > > > > + > > > > > > + if (!virtfn) > > > > > > + continue; > > > > > > + > > > > > > + sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group); > > > > > > + } > > > > > > + sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group); > > > > > > +} > > > > > > +EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay); > > > > > > > > > > I'm not convinced all this sysfs wrangling is necessary. If it is, > > > > > add a hint in a comment about why this is special and can't use > > > > > something like sriov_dev_attr_group. > > > > > > > > This makes the overlay to be PF-driven. Alexander insisted on this flow. > > > > > > If you're referring to [1], I think "insisted on this flow" might be > > > an overstatement of what Alex was looking for. IIUC Alex wants the > > > sysfs files to be visible only when they're useful, i.e., when a > > > driver implements ->sriov_set_msix_vec_count(). > > > > It is only one side of the request, the sysfs files shouldn't be visible > > if PF driver was removed and visible again when its probed again. > > I can't parse this, but it's probably related to the question below. > > > > That seems reasonable and also seems like something a smarter > > > .is_visible() function could accomplish without having drivers call > > > pci_enable_vfs_overlay(), e.g., maybe some variation of this: > > > > > > static umode_t sriov_vf_attrs_are_visible(...) > > > { > > > if (!pdev->msix_cap || dev_is_pf(dev)) > > > return 0; > > > > > > pf = pci_physfn(pdev); > > > if (pf->driver && pf->driver->sriov_set_msix_vec_count) > > > return a->mode; > > > > > > return 0; > > > } > > > > It doesn't work with the following flow: > > 1. load driver > > 2. disable autoprobe > > 3. echo to sriov_numvfs > > .... <--- you have this sriov_vf_attrs_are_visible() created > > 4. unload driver > > .... <--- sysfs still exists despite not having PF driver. > > I missed your point here, sorry. After unloading the PF driver, > "pf->driver" in the sketch above will be NULL, so the VF sysfs file > would not be visible. Right? Maybe it has to do with autoprobe? I > didn't catch what the significance of disabling autoprobe was. No, is_visible() is called on file creation only, see fs/sysfs/group.c: create_files(). After you remove driver, there is no sysfs file refresh. > > > > > > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) > > > > > > +{ > > > > > > + if (!dev->is_physfn) > > > > > > + return; > > > > > > + > > > > > > + dev->sriov->vf_total_msix = count; > > > > > > > > > > The PCI core doesn't use vf_total_msix at all. The driver, e.g., > > > > > mlx5, calls this, and all the PCI core does is hang onto the value and > > > > > expose it via sysfs. I think I'd rather have a callback in struct > > > > > pci_driver and let the driver supply the value when needed. I.e., > > > > > sriov_vf_total_msix_show() would call the callback instead of looking > > > > > at pdev->sriov->vf_total_msix. > > > > > > > > It will cause to unnecessary locking to ensure that driver doesn't > > > > vanish during sysfs read. I can change, but don't think that it is right > > > > decision. > > > > > > Doesn't sysfs already ensure the driver can't vanish while we're > > > executing a DEVICE_ATTR accessor? > > > > It is not, you can see it by adding device_lock_held() check in any > > .show attribute. See drivers/base/core.c: dev_attr_show(), it doesn't > > do much. This is why pci_vf_set_msix_vec_count() has double lock. > > Aahh, right, I learned something today, thanks! There are only a few > PCI sysfs attributes that reference the driver, and they do their own > locking. > > I do think vf_total_msix is a bit of driver state related to > implementing this functionality and doesn't need to be in the PCI > core. I think device locking is acceptable; it's very similar to what > is done in sriov_numvfs_store(). Doing the locking and calling a > driver callback makes it obvious that vf_total_msix is part of this PF > driver-specific functionality, not a generic part of the PCI core. > > So let's give it a try. If it turns out to be terrible, we can > revisit it. No problem. > > > > Also, pci_vf_set_msix_vec_count() is in pci/msi.c, but AFAICT there's > > > no actual *reason* for it to be there other than the fact that it has > > > "msix" in the name. It uses no MSI data structures. Maybe it could > > > be folded into sriov_vf_msix_count_store(), which would make the > > > analysis even easier. > > > > I put _set_ command near _get_ command, but I can move it to iov.c > > You mean you put pci_vf_set_msix_vec_count() near > pci_msix_vec_count()? That's *true*, but they are not analogues, and > one is not setting the value returned by the other. > > pci_vf_set_msix_vec_count() is a completely magical thing that uses a > device-specific mechanism on a PF that happens to change what > pci_msix_vec_count() on a VF will return later. I think this is more > related to SR-IOV than it is to MSI. I will do. > > Bjorn > > > > [1] https://lore.kernel.org/r/CAKgT0UcJQ3uy6J_CCLizDLfzGL2saa_PjOYH4nK+RQjfmpNA=w@mail.gmail.com > > [2] https://lore.kernel.org/linux-pci/20210118132800.GA4835@unreal/
On Fri, Feb 05, 2021 at 07:35:47PM +0200, Leon Romanovsky wrote: > On Thu, Feb 04, 2021 at 03:12:12PM -0600, Bjorn Helgaas wrote: > > On Thu, Feb 04, 2021 at 05:50:48PM +0200, Leon Romanovsky wrote: > > > On Wed, Feb 03, 2021 at 06:10:39PM -0600, Bjorn Helgaas wrote: > > > > On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote: > > > > > On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote: > > > > > > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote: > > > > > > > From: Leon Romanovsky <leonro@nvidia.com> > > > > > > > > > > > > > > Extend PCI sysfs interface with a new callback that allows > > > > > > > configure the number of MSI-X vectors for specific SR-IO VF. > > > > > > > This is needed to optimize the performance of newly bound > > > > > > > devices by allocating the number of vectors based on the > > > > > > > administrator knowledge of targeted VM. > > > > > > > > > > > > I'm reading between the lines here, but IIUC the point is that you > > > > > > have a PF that supports a finite number of MSI-X vectors for use > > > > > > by all the VFs, and this interface is to control the distribution > > > > > > of those MSI-X vectors among the VFs. > > > > > > This commit log should describe how *this* device manages this > > > > allocation and how the PF Table Size and the VF Table Sizes are > > > > related. Per PCIe, there is no necessary connection between them. > > > > > > There is no connection in mlx5 devices either. PF is used as a vehicle > > > to access VF that doesn't have driver yet. From "table size" perspective > > > they completely independent, because PF already probed by driver and > > > it is already too late to change it. > > > > > > So PF table size is static and can be changed through FW utility only. > > > > This is where description of the device would be useful. > > > > The fact that you need "sriov_vf_total_msix" to advertise how many > > vectors are available and "sriov_vf_msix_count" to influence how they > > are distributed across the VFs suggests that these Table Sizes are not > > completely independent. > > > > Can a VF have a bigger Table Size than the PF does? Can all the VF > > Table Sizes added together be bigger than the PF Table Size? If VF A > > has a larger Table Size, does that mean VF B must have a smaller Table > > Size? > > VFs are completely independent devices and their table size can be > bigger than PF. FW has two pools, one for PF and another for all VFs. > In real world scenarios, every VF will have more MSI-X vectors than PF, > which will be distributed by orchestration software. Well, if the sum of all the VF Table Sizes cannot exceed the size of the FW pool for VFs, I would say the VFs are not completely independent. Increasing the Table Size of one VF reduces it for other VFs. This is an essential detail because it's the whole reason behind this interface, so sketching this out in the commit log will make this much easier to understand. > > Here's the sequence as I understand it: > > > > 1) PF driver binds to PF > > 2) PF driver enables VFs > > 3) PF driver creates /sys/.../<PF>/sriov_vf_total_msix > > 4) PF driver creates /sys/.../<VFn>/sriov_vf_msix_count for each VF > > 5) Management app reads sriov_vf_total_msix, writes sriov_vf_msix_count > > 6) VF driver binds to VF > > 7) VF reads MSI-X Message Control (Table Size) > > > > Is it true that "lspci VF" at 4.1 and "lspci VF" at 5.1 may read > > different Table Sizes? That would be a little weird. > > Yes, this is the flow. I think differently from you and think this > is actual good thing that user writes new msix count and it is shown > immediately. Only weird because per spec Table Size is read-only and in this scenario it changes, so it may be surprising, but probably not a huge deal. > > I'm also a little concerned about doing 2 before 3 & 4. That works > > for mlx5 because implements the Table Size adjustment in a way that > > works *after* the VFs have been enabled. > > It is not related to mlx5, but to the PCI spec that requires us to > create all VFs at the same time. Before enabling VFs, they don't > exist. Yes. I can imagine a PF driver that collects characteristics for the desired VFs before enabling them, sort of like we already collect the *number* of VFs. But I think your argument for dynamic changes after creation below is pretty compelling. > > But it seems conceivable that a device could implement vector > > distribution in a way that would require the VF Table Sizes to be > > fixed *before* enabling VFs. That would be nice in the sense that the > > VFs would be created "fully formed" and the VF Table Size would be > > completely read-only as documented. > > It is not how SR-IOV is used in real world. Cloud providers create many > VFs but don't assign those VFs yet. They do it after customer request > VM with specific properties (number of CPUs) which means number of MSI-X > vectors in our case. > > All this is done when other VFs already in use and we can't destroy and > recreate them at that point of time. Makes sense, thanks for this insight. The need to change the MSI-X Table Size dynamically while other VFs are in use would also be useful in the commit log because it really helps motivate this design. > > > > > > > +What: /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix > > > > > > > +Date: January 2021 > > > > > > > +Contact: Leon Romanovsky <leonro@nvidia.com> > > > > > > > +Description: > > > > > > > + This file is associated with the SR-IOV PFs. > > > > > > > + It returns a total number of possible to configure MSI-X > > > > > > > + vectors on the enabled VFs. > > > > > > > + > > > > > > > + The values returned are: > > > > > > > + * > 0 - this will be total number possible to consume by VFs, > > > > > > > + * = 0 - feature is not supported > > > > > > > Not sure the "= 0" description is necessary here. If the value > > > > returned is the number of MSI-X vectors available for assignment to > > > > VFs, "0" is a perfectly legitimate value. It just means there are > > > > none. It doesn't need to be described separately. > > > > > > I wanted to help users and remove ambiguity. For example, mlx5 drivers > > > will always implement proper .set_...() callbacks but for some devices > > > without needed FW support, the value will be 0. Instead of misleading > > > users with wrong promise that feature supported but doesn't have > > > available vectors, I decided to be more clear. For the users, 0 means, don't > > > try, it is not working. > > > > Oh, you mean "feature is not supported by the FIRMWARE on some mlx5 > > devices"? I totally missed that; I thought you meant "not supported > > by the PF driver." Why not have the PF driver detect that the > > firmware doesn't support the feature and just not expose the sysfs > > files at all in that case? > > I can do it and will do it, but this behavior will be possible because > mlx5 is flexible enough. I imagine that other device won't be so flexible, > for example they will decide to "close" this feature because of other > SW limitation. What about something like this? This file contains the total number of MSI-X vectors available for assignment to all VFs associated with this PF. It may be zero if the device doesn't support this functionality. Side question: How does user-space use this? This file contains a constant, and there's no interface to learn how many vectors are still available in the pool, right? I guess the user just has to manage the values written to .../VF<n>/sriov_vf_msix_count so the sum of all of them never exceeds sriov_vf_total_msix? If that user app crashes, I guess it can reconstruct this state by reading all the .../VF<n>/sriov_vf_msix_count files? > > > > If we put them in a new "vfs_overlay" directory, it seems like > > > > overkill to repeat the "vf" part, but I'm hoping the new files can end > > > > up next to these existing files. In that case, I think it makes sense > > > > to include "sriov". And it probably does make sense to include "vf" > > > > as well. > > > > > > I put everything in folder to group any possible future extensions. > > > Those extensions are applicable to SR-IOV VFs only, IMHO they deserve > > > separate folder. > > > > I'm not convinced (yet) that the possibility of future extensions is > > enough justification for adding the "vfs_overlay" directory. It > > really complicates the code flow -- if we skipped the new directory, > > I'm pretty sure we could make .is_visible() work, which would be a > > major simplification. > > Unfortunately not, is_visible() is not dynamic and called when device > creates sysfs and it doesn't rescan it or refresh them. It means that > all files from vfs_overlay folder will exist even driver is removed > from PF. > > Also sysfs is created before driver is probed. Ah, both excellent points, that makes this much clearer to me, thank you! > > And there's quite a bit of value in the new files being right next to > > the existing sriov_* files. > > I have no problem to drop folder, just need clear request, should I. I think we should drop the folder. I missed the previous discussion though, so if somebody has a good objection to dropping it, let me know. Dropping the folder has the potential advantage that we *could* decide to expose these files always, and just have them be non-functional if the device doesn't support assignment or the PF driver isn't bound. Then I think we could use static sysfs attributes without pci_enable_vfs_overlay(). Bjorn
On Fri, Feb 05, 2021 at 04:57:03PM -0600, Bjorn Helgaas wrote: > On Fri, Feb 05, 2021 at 07:35:47PM +0200, Leon Romanovsky wrote: > > On Thu, Feb 04, 2021 at 03:12:12PM -0600, Bjorn Helgaas wrote: > > > On Thu, Feb 04, 2021 at 05:50:48PM +0200, Leon Romanovsky wrote: > > > > On Wed, Feb 03, 2021 at 06:10:39PM -0600, Bjorn Helgaas wrote: > > > > > On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote: > > > > > > On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote: > > > > > > > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote: > > > > > > > > From: Leon Romanovsky <leonro@nvidia.com> > > > > > > > > > > > > > > > > Extend PCI sysfs interface with a new callback that allows > > > > > > > > configure the number of MSI-X vectors for specific SR-IO VF. > > > > > > > > This is needed to optimize the performance of newly bound > > > > > > > > devices by allocating the number of vectors based on the > > > > > > > > administrator knowledge of targeted VM. > > > > > > > > > > > > > > I'm reading between the lines here, but IIUC the point is that you > > > > > > > have a PF that supports a finite number of MSI-X vectors for use > > > > > > > by all the VFs, and this interface is to control the distribution > > > > > > > of those MSI-X vectors among the VFs. > > > > > > > > This commit log should describe how *this* device manages this > > > > > allocation and how the PF Table Size and the VF Table Sizes are > > > > > related. Per PCIe, there is no necessary connection between them. > > > > > > > > There is no connection in mlx5 devices either. PF is used as a vehicle > > > > to access VF that doesn't have driver yet. From "table size" perspective > > > > they completely independent, because PF already probed by driver and > > > > it is already too late to change it. > > > > > > > > So PF table size is static and can be changed through FW utility only. > > > > > > This is where description of the device would be useful. > > > > > > The fact that you need "sriov_vf_total_msix" to advertise how many > > > vectors are available and "sriov_vf_msix_count" to influence how they > > > are distributed across the VFs suggests that these Table Sizes are not > > > completely independent. > > > > > > Can a VF have a bigger Table Size than the PF does? Can all the VF > > > Table Sizes added together be bigger than the PF Table Size? If VF A > > > has a larger Table Size, does that mean VF B must have a smaller Table > > > Size? > > > > VFs are completely independent devices and their table size can be > > bigger than PF. FW has two pools, one for PF and another for all VFs. > > In real world scenarios, every VF will have more MSI-X vectors than PF, > > which will be distributed by orchestration software. > > Well, if the sum of all the VF Table Sizes cannot exceed the size of > the FW pool for VFs, I would say the VFs are not completely > independent. Increasing the Table Size of one VF reduces it for other > VFs. > > This is an essential detail because it's the whole reason behind this > interface, so sketching this out in the commit log will make this much > easier to understand. I think that it is already written, but will recheck. > > > > Here's the sequence as I understand it: > > > > > > 1) PF driver binds to PF > > > 2) PF driver enables VFs > > > 3) PF driver creates /sys/.../<PF>/sriov_vf_total_msix > > > 4) PF driver creates /sys/.../<VFn>/sriov_vf_msix_count for each VF > > > 5) Management app reads sriov_vf_total_msix, writes sriov_vf_msix_count > > > 6) VF driver binds to VF > > > 7) VF reads MSI-X Message Control (Table Size) > > > > > > Is it true that "lspci VF" at 4.1 and "lspci VF" at 5.1 may read > > > different Table Sizes? That would be a little weird. > > > > Yes, this is the flow. I think differently from you and think this > > is actual good thing that user writes new msix count and it is shown > > immediately. > > Only weird because per spec Table Size is read-only and in this > scenario it changes, so it may be surprising, but probably not a huge > deal. > > > > I'm also a little concerned about doing 2 before 3 & 4. That works > > > for mlx5 because implements the Table Size adjustment in a way that > > > works *after* the VFs have been enabled. > > > > It is not related to mlx5, but to the PCI spec that requires us to > > create all VFs at the same time. Before enabling VFs, they don't > > exist. > > Yes. I can imagine a PF driver that collects characteristics for the > desired VFs before enabling them, sort of like we already collect the > *number* of VFs. But I think your argument for dynamic changes after > creation below is pretty compelling. IMHO, the best tool for such pre-configured changes will be devlink. > > > > But it seems conceivable that a device could implement vector > > > distribution in a way that would require the VF Table Sizes to be > > > fixed *before* enabling VFs. That would be nice in the sense that the > > > VFs would be created "fully formed" and the VF Table Size would be > > > completely read-only as documented. > > > > It is not how SR-IOV is used in real world. Cloud providers create many > > VFs but don't assign those VFs yet. They do it after customer request > > VM with specific properties (number of CPUs) which means number of MSI-X > > vectors in our case. > > > > All this is done when other VFs already in use and we can't destroy and > > recreate them at that point of time. > > Makes sense, thanks for this insight. The need to change the MSI-X > Table Size dynamically while other VFs are in use would also be useful > in the commit log because it really helps motivate this design. It is already written, but I will add more info. > > > > > > > > > +What: /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix > > > > > > > > +Date: January 2021 > > > > > > > > +Contact: Leon Romanovsky <leonro@nvidia.com> > > > > > > > > +Description: > > > > > > > > + This file is associated with the SR-IOV PFs. > > > > > > > > + It returns a total number of possible to configure MSI-X > > > > > > > > + vectors on the enabled VFs. > > > > > > > > + > > > > > > > > + The values returned are: > > > > > > > > + * > 0 - this will be total number possible to consume by VFs, > > > > > > > > + * = 0 - feature is not supported > > > > > > > > > Not sure the "= 0" description is necessary here. If the value > > > > > returned is the number of MSI-X vectors available for assignment to > > > > > VFs, "0" is a perfectly legitimate value. It just means there are > > > > > none. It doesn't need to be described separately. > > > > > > > > I wanted to help users and remove ambiguity. For example, mlx5 drivers > > > > will always implement proper .set_...() callbacks but for some devices > > > > without needed FW support, the value will be 0. Instead of misleading > > > > users with wrong promise that feature supported but doesn't have > > > > available vectors, I decided to be more clear. For the users, 0 means, don't > > > > try, it is not working. > > > > > > Oh, you mean "feature is not supported by the FIRMWARE on some mlx5 > > > devices"? I totally missed that; I thought you meant "not supported > > > by the PF driver." Why not have the PF driver detect that the > > > firmware doesn't support the feature and just not expose the sysfs > > > files at all in that case? > > > > I can do it and will do it, but this behavior will be possible because > > mlx5 is flexible enough. I imagine that other device won't be so flexible, > > for example they will decide to "close" this feature because of other > > SW limitation. > > What about something like this? > > This file contains the total number of MSI-X vectors available for > assignment to all VFs associated with this PF. It may be zero if > the device doesn't support this functionality. Sounds good. > > Side question: How does user-space use this? This file contains a > constant, and there's no interface to learn how many vectors are still > available in the pool, right? Right, it is easy for the kernel implementation and easy for the users. They don't need from the kernel to see exact utilized number. Every VF has vectors assigned from the beginning and there is no parallel race between kernel and user space to claim resources in our flow, so at the point when orchestration will have a chance to run the system will be in steady state. Access to the hypervisor with ability to write to sysfs files requires root permissions and management software already counts number of users, their CPUs and number of vectors. > > I guess the user just has to manage the values written to > .../VF<n>/sriov_vf_msix_count so the sum of all of them never exceeds > sriov_vf_total_msix? If that user app crashes, I guess it can > reconstruct this state by reading all the > .../VF<n>/sriov_vf_msix_count files? Yes, if orchestration software crashes on hypervisor, it will simply reiterate all VFs from the beginning. > > > > > > If we put them in a new "vfs_overlay" directory, it seems like > > > > > overkill to repeat the "vf" part, but I'm hoping the new files can end > > > > > up next to these existing files. In that case, I think it makes sense > > > > > to include "sriov". And it probably does make sense to include "vf" > > > > > as well. > > > > > > > > I put everything in folder to group any possible future extensions. > > > > Those extensions are applicable to SR-IOV VFs only, IMHO they deserve > > > > separate folder. > > > > > > I'm not convinced (yet) that the possibility of future extensions is > > > enough justification for adding the "vfs_overlay" directory. It > > > really complicates the code flow -- if we skipped the new directory, > > > I'm pretty sure we could make .is_visible() work, which would be a > > > major simplification. > > > > Unfortunately not, is_visible() is not dynamic and called when device > > creates sysfs and it doesn't rescan it or refresh them. It means that > > all files from vfs_overlay folder will exist even driver is removed > > from PF. > > > > Also sysfs is created before driver is probed. > > Ah, both excellent points, that makes this much clearer to me, thank > you! > > > > And there's quite a bit of value in the new files being right next to > > > the existing sriov_* files. > > > > I have no problem to drop folder, just need clear request, should I. > > I think we should drop the folder. I missed the previous discussion > though, so if somebody has a good objection to dropping it, let me > know. Just to be clear, dropping "vfs_overlay" folder won't remove any complexity with pci_enable_vfs_overlay()/pci_disable_vfs_overlay(), but will cause to simple change of the name attribute in the attribute_group. > > Dropping the folder has the potential advantage that we *could* decide > to expose these files always, and just have them be non-functional if > the device doesn't support assignment or the PF driver isn't bound. > Then I think we could use static sysfs attributes without > pci_enable_vfs_overlay(). Such static initialization was in v0..v3 versions, but the request was to have new files to be PF-driven and it requires enable/disable callbacks. They need to allow the flow when PF driver is bounded after VF devices already created. For example this flow: 1. modprobe mlx5_core 2. echo 5 > .../sriov_numvf 3. rmmod mlx5_core ... 4. modprobe mlx5_core <- need to reattach to already existing VFs. Thanks > > Bjorn
diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci index 25c9c39770c6..4d206ade5331 100644 --- a/Documentation/ABI/testing/sysfs-bus-pci +++ b/Documentation/ABI/testing/sysfs-bus-pci @@ -375,3 +375,35 @@ Description: The value comes from the PCI kernel device state and can be one of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold". The file is read only. + +What: /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count +Date: January 2021 +Contact: Leon Romanovsky <leonro@nvidia.com> +Description: + This file is associated with the SR-IOV VFs. + It allows configuration of the number of MSI-X vectors for + the VF. This is needed to optimize performance of newly bound + devices by allocating the number of vectors based on the + administrator knowledge of targeted VM. + + The values accepted are: + * > 0 - this will be number reported by the VF's MSI-X + capability + * < 0 - not valid + * = 0 - will reset to the device default value + + The file is writable if the PF is bound to a driver that + set sriov_vf_total_msix > 0 and there is no driver bound + to the VF. + +What: /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix +Date: January 2021 +Contact: Leon Romanovsky <leonro@nvidia.com> +Description: + This file is associated with the SR-IOV PFs. + It returns a total number of possible to configure MSI-X + vectors on the enabled VFs. + + The values returned are: + * > 0 - this will be total number possible to consume by VFs, + * = 0 - feature is not supported diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 4afd4ee4f7f0..3e95f835eba5 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id) return (dev->devfn + dev->sriov->offset + dev->sriov->stride * vf_id) & 0xff; } +EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn); /* * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may @@ -157,6 +158,166 @@ int pci_iov_sysfs_link(struct pci_dev *dev, return rc; } +#ifdef CONFIG_PCI_MSI +static ssize_t sriov_vf_msix_count_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct pci_dev *pdev = to_pci_dev(dev); + int count = pci_msix_vec_count(pdev); + + if (count < 0) + return count; + + return sysfs_emit(buf, "%d\n", count); +} + +static ssize_t sriov_vf_msix_count_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct pci_dev *vf_dev = to_pci_dev(dev); + int val, ret; + + ret = kstrtoint(buf, 0, &val); + if (ret) + return ret; + + ret = pci_vf_set_msix_vec_count(vf_dev, val); + if (ret) + return ret; + + return count; +} +static DEVICE_ATTR_RW(sriov_vf_msix_count); + +static ssize_t sriov_vf_total_msix_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct pci_dev *pdev = to_pci_dev(dev); + + return sysfs_emit(buf, "%u\n", pdev->sriov->vf_total_msix); +} +static DEVICE_ATTR_RO(sriov_vf_total_msix); +#endif + +static struct attribute *sriov_pf_dev_attrs[] = { +#ifdef CONFIG_PCI_MSI + &dev_attr_sriov_vf_total_msix.attr, +#endif + NULL, +}; + +static struct attribute *sriov_vf_dev_attrs[] = { +#ifdef CONFIG_PCI_MSI + &dev_attr_sriov_vf_msix_count.attr, +#endif + NULL, +}; + +static umode_t sriov_pf_attrs_are_visible(struct kobject *kobj, + struct attribute *a, int n) +{ + struct device *dev = kobj_to_dev(kobj); + struct pci_dev *pdev = to_pci_dev(dev); + + if (!pdev->msix_cap || !dev_is_pf(dev)) + return 0; + + return a->mode; +} + +static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj, + struct attribute *a, int n) +{ + struct device *dev = kobj_to_dev(kobj); + struct pci_dev *pdev = to_pci_dev(dev); + + if (!pdev->msix_cap || dev_is_pf(dev)) + return 0; + + return a->mode; +} + +static const struct attribute_group sriov_pf_dev_attr_group = { + .attrs = sriov_pf_dev_attrs, + .is_visible = sriov_pf_attrs_are_visible, + .name = "vfs_overlay", +}; + +static const struct attribute_group sriov_vf_dev_attr_group = { + .attrs = sriov_vf_dev_attrs, + .is_visible = sriov_vf_attrs_are_visible, + .name = "vfs_overlay", +}; + +int pci_enable_vfs_overlay(struct pci_dev *dev) +{ + struct pci_dev *virtfn; + int id, ret; + + if (!dev->is_physfn || !dev->sriov->num_VFs) + return 0; + + ret = sysfs_create_group(&dev->dev.kobj, &sriov_pf_dev_attr_group); + if (ret) + return ret; + + for (id = 0; id < dev->sriov->num_VFs; id++) { + virtfn = pci_get_domain_bus_and_slot( + pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id), + pci_iov_virtfn_devfn(dev, id)); + + if (!virtfn) + continue; + + ret = sysfs_create_group(&virtfn->dev.kobj, + &sriov_vf_dev_attr_group); + if (ret) + goto out; + } + return 0; + +out: + while (id--) { + virtfn = pci_get_domain_bus_and_slot( + pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id), + pci_iov_virtfn_devfn(dev, id)); + + if (!virtfn) + continue; + + sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group); + } + sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group); + return ret; +} +EXPORT_SYMBOL_GPL(pci_enable_vfs_overlay); + +void pci_disable_vfs_overlay(struct pci_dev *dev) +{ + struct pci_dev *virtfn; + int id; + + if (!dev->is_physfn || !dev->sriov->num_VFs) + return; + + id = dev->sriov->num_VFs; + while (id--) { + virtfn = pci_get_domain_bus_and_slot( + pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id), + pci_iov_virtfn_devfn(dev, id)); + + if (!virtfn) + continue; + + sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group); + } + sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group); +} +EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay); + int pci_iov_add_virtfn(struct pci_dev *dev, int id) { int i; @@ -596,6 +757,7 @@ static void sriov_disable(struct pci_dev *dev) sysfs_remove_link(&dev->dev.kobj, "dep_link"); iov->num_VFs = 0; + iov->vf_total_msix = 0; pci_iov_set_numvfs(dev, 0); } @@ -1054,6 +1216,24 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev) } EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs); +/** + * pci_sriov_set_vf_total_msix - set total number of MSI-X vectors for the VFs + * @dev: the PCI PF device + * @count: the total number of MSI-X vector to consume by the VFs + * + * Sets the number of MSI-X vectors that is possible to consume by the VFs. + * This interface is complimentary part of the pci_vf_set_msix_vec_count() + * that will be used to configure the required number on the VF. + */ +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) +{ + if (!dev->is_physfn) + return; + + dev->sriov->vf_total_msix = count; +} +EXPORT_SYMBOL_GPL(pci_sriov_set_vf_total_msix); + /** * pci_sriov_configure_simple - helper to configure SR-IOV * @dev: the PCI device diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 3162f88fe940..1022fe9e6efd 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -991,6 +991,53 @@ int pci_msix_vec_count(struct pci_dev *dev) } EXPORT_SYMBOL(pci_msix_vec_count); +/** + * pci_vf_set_msix_vec_count - change the reported number of MSI-X vectors + * This function is applicable for SR-IOV VF because such devices allocate + * their MSI-X table before they will run on the VMs and HW can't guess the + * right number of vectors, so the HW allocates them statically and equally. + * @dev: VF device that is going to be changed + * @count: amount of MSI-X vectors + **/ +int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count) +{ + struct pci_dev *pdev = pci_physfn(dev); + int ret; + + if (count < 0) + /* + * We don't support negative numbers for now, + * but maybe in the future it will make sense. + */ + return -EINVAL; + + device_lock(&pdev->dev); + if (!pdev->driver) { + ret = -EOPNOTSUPP; + goto err_pdev; + } + + device_lock(&dev->dev); + if (dev->driver) { + /* + * Driver already probed this VF and configured itself + * based on previously configured (or default) MSI-X vector + * count. It is too late to change this field for this + * specific VF. + */ + ret = -EBUSY; + goto err_dev; + } + + ret = pdev->driver->sriov_set_msix_vec_count(dev, count); + +err_dev: + device_unlock(&dev->dev); +err_pdev: + device_unlock(&pdev->dev); + return ret; +} + static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec, struct irq_affinity *affd, int flags) { diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 5c59365092fa..2bd6560d91e2 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -183,6 +183,7 @@ extern unsigned int pci_pm_d3hot_delay; #ifdef CONFIG_PCI_MSI void pci_no_msi(void); +int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count); #else static inline void pci_no_msi(void) { } #endif @@ -326,6 +327,9 @@ struct pci_sriov { u16 subsystem_device; /* VF subsystem device */ resource_size_t barsz[PCI_SRIOV_NUM_BARS]; /* VF BAR size */ bool drivers_autoprobe; /* Auto probing of VFs by driver */ + u32 vf_total_msix; /* Total number of MSI-X vectors the VFs + * can consume + */ }; /** diff --git a/include/linux/pci.h b/include/linux/pci.h index b32126d26997..24d118ad6e7b 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -856,6 +856,8 @@ struct module; * e.g. drivers/net/e100.c. * @sriov_configure: Optional driver callback to allow configuration of * number of VFs to enable via sysfs "sriov_numvfs" file. + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors + * exposed by the sysfs "vf_msix_vec" entry. * @err_handler: See Documentation/PCI/pci-error-recovery.rst * @groups: Sysfs attribute groups. * @driver: Driver model structure. @@ -871,6 +873,7 @@ struct pci_driver { int (*resume)(struct pci_dev *dev); /* Device woken up */ void (*shutdown)(struct pci_dev *dev); int (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */ + int (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */ const struct pci_error_handlers *err_handler; const struct attribute_group **groups; struct device_driver driver; @@ -2059,6 +2062,9 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar); int pci_iov_virtfn_bus(struct pci_dev *dev, int id); int pci_iov_virtfn_devfn(struct pci_dev *dev, int id); +int pci_enable_vfs_overlay(struct pci_dev *dev); +void pci_disable_vfs_overlay(struct pci_dev *dev); + int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); void pci_disable_sriov(struct pci_dev *dev); @@ -2072,6 +2078,7 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev); int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn); resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno); void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe); +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count); /* Arch may override these (weak) */ int pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs); @@ -2100,6 +2107,8 @@ static inline int pci_iov_add_virtfn(struct pci_dev *dev, int id) } static inline void pci_iov_remove_virtfn(struct pci_dev *dev, int id) { } +static inline int pci_enable_vfs_overlay(struct pci_dev *dev) { return 0; } +static inline void pci_disable_vfs_overlay(struct pci_dev *dev) {} static inline void pci_disable_sriov(struct pci_dev *dev) { } static inline int pci_num_vf(struct pci_dev *dev) { return 0; } static inline int pci_vfs_assigned(struct pci_dev *dev) @@ -2112,6 +2121,7 @@ static inline int pci_sriov_get_totalvfs(struct pci_dev *dev) static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno) { return 0; } static inline void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe) { } +static inline void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) {} #endif #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)