Message ID | 20220425092615.10133-6-abhsahu@nvidia.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | vfio/pci: power management changes | expand |
On Mon, 25 Apr 2022 14:56:12 +0530 Abhishek Sahu <abhsahu@nvidia.com> wrote: > Currently, there is very limited power management support > available in the upstream vfio_pci_core based drivers. If there > are no users of the device, then the PCI device will be moved into > D3hot state by writing directly into PCI PM registers. This D3hot > state help in saving power but we can achieve zero power consumption > if we go into the D3cold state. The D3cold state cannot be possible > with native PCI PM. It requires interaction with platform firmware > which is system-specific. To go into low power states (including D3cold), > the runtime PM framework can be used which internally interacts with PCI > and platform firmware and puts the device into the lowest possible > D-States. > > This patch registers vfio_pci_core based drivers with the > runtime PM framework. > > 1. The PCI core framework takes care of most of the runtime PM > related things. For enabling the runtime PM, the PCI driver needs to > decrement the usage count and needs to provide 'struct dev_pm_ops' > at least. The runtime suspend/resume callbacks are optional and needed > only if we need to do any extra handling. Now there are multiple > vfio_pci_core based drivers. Instead of assigning the > 'struct dev_pm_ops' in individual parent driver, the vfio_pci_core > itself assigns the 'struct dev_pm_ops'. There are other drivers where > the 'struct dev_pm_ops' is being assigned inside core layer > (For example, wlcore_probe() and some sound based driver, etc.). > > 2. This patch provides the stub implementation of 'struct dev_pm_ops'. > The subsequent patch will provide the runtime suspend/resume > callbacks. All the config state saving, and PCI power management > related things will be done by PCI core framework itself inside its > runtime suspend/resume callbacks (pci_pm_runtime_suspend() and > pci_pm_runtime_resume()). > > 3. Inside pci_reset_bus(), all the devices in dev_set needs to be > runtime resumed. vfio_pci_dev_set_pm_runtime_get() will take > care of the runtime resume and its error handling. > > 4. Inside vfio_pci_core_disable(), the device usage count always needs > to be decremented which was incremented in vfio_pci_core_enable(). > > 5. Since the runtime PM framework will provide the same functionality, > so directly writing into PCI PM config register can be replaced with > the use of runtime PM routines. Also, the use of runtime PM can help > us in more power saving. > > In the systems which do not support D3cold, > > With the existing implementation: > > // PCI device > # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state > D3hot > // upstream bridge > # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state > D0 > > With runtime PM: > > // PCI device > # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state > D3hot > // upstream bridge > # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state > D3hot > > So, with runtime PM, the upstream bridge or root port will also go > into lower power state which is not possible with existing > implementation. > > In the systems which support D3cold, > > // PCI device > # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state > D3hot > // upstream bridge > # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state > D0 > > With runtime PM: > > // PCI device > # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state > D3cold > // upstream bridge > # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state > D3cold > > So, with runtime PM, both the PCI device and upstream bridge will > go into D3cold state. > > 6. If 'disable_idle_d3' module parameter is set, then also the runtime > PM will be enabled, but in this case, the usage count should not be > decremented. > > 7. vfio_pci_dev_set_try_reset() return value is unused now, so this > function return type can be changed to void. > > 8. Use the runtime PM API's in vfio_pci_core_sriov_configure(). > For preventing any runtime usage mismatch, pci_num_vf() has been > called explicitly during disable. > > Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com> > --- > drivers/vfio/pci/vfio_pci_core.c | 169 +++++++++++++++++++++---------- > 1 file changed, 114 insertions(+), 55 deletions(-) > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > index 953ac33b2f5f..aee5e0cd6137 100644 > --- a/drivers/vfio/pci/vfio_pci_core.c > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -156,7 +156,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev) > } > > struct vfio_pci_group_info; > -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set); > +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set); > static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, > struct vfio_pci_group_info *groups); > > @@ -261,6 +261,19 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat > return ret; > } > > +#ifdef CONFIG_PM > +/* > + * The dev_pm_ops needs to be provided to make pci-driver runtime PM working, > + * so use structure without any callbacks. > + * > + * The pci-driver core runtime PM routines always save the device state > + * before going into suspended state. If the device is going into low power > + * state with only with runtime PM ops, then no explicit handling is needed > + * for the devices which have NoSoftRst-. > + */ > +static const struct dev_pm_ops vfio_pci_core_pm_ops = { }; > +#endif > + > int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) > { > struct pci_dev *pdev = vdev->pdev; > @@ -268,21 +281,23 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) > u16 cmd; > u8 msix_pos; > > - vfio_pci_set_power_state(vdev, PCI_D0); > + if (!disable_idle_d3) { > + ret = pm_runtime_resume_and_get(&pdev->dev); > + if (ret < 0) > + return ret; > + } > > /* Don't allow our initial saved state to include busmaster */ > pci_clear_master(pdev); > > ret = pci_enable_device(pdev); > if (ret) > - return ret; > + goto out_power; > > /* If reset fails because of the device lock, fail this path entirely */ > ret = pci_try_reset_function(pdev); > - if (ret == -EAGAIN) { > - pci_disable_device(pdev); > - return ret; > - } > + if (ret == -EAGAIN) > + goto out_disable_device; > > vdev->reset_works = !ret; > pci_save_state(pdev); > @@ -306,12 +321,8 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) > } > > ret = vfio_config_init(vdev); > - if (ret) { > - kfree(vdev->pci_saved_state); > - vdev->pci_saved_state = NULL; > - pci_disable_device(pdev); > - return ret; > - } > + if (ret) > + goto out_free_state; > > msix_pos = pdev->msix_cap; > if (msix_pos) { > @@ -332,6 +343,16 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) > > > return 0; > + > +out_free_state: > + kfree(vdev->pci_saved_state); > + vdev->pci_saved_state = NULL; > +out_disable_device: > + pci_disable_device(pdev); > +out_power: > + if (!disable_idle_d3) > + pm_runtime_put(&pdev->dev); > + return ret; > } > EXPORT_SYMBOL_GPL(vfio_pci_core_enable); > > @@ -439,8 +460,11 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev) > out: > pci_disable_device(pdev); > > - if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3) > - vfio_pci_set_power_state(vdev, PCI_D3hot); > + vfio_pci_dev_set_try_reset(vdev->vdev.dev_set); > + > + /* Put the pm-runtime usage counter acquired during enable */ > + if (!disable_idle_d3) > + pm_runtime_put(&pdev->dev); > } > EXPORT_SYMBOL_GPL(vfio_pci_core_disable); > > @@ -1879,19 +1903,24 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev, > > vfio_pci_probe_power_state(vdev); > > - if (!disable_idle_d3) { > - /* > - * pci-core sets the device power state to an unknown value at > - * bootup and after being removed from a driver. The only > - * transition it allows from this unknown state is to D0, which > - * typically happens when a driver calls pci_enable_device(). > - * We're not ready to enable the device yet, but we do want to > - * be able to get to D3. Therefore first do a D0 transition > - * before going to D3. > - */ > - vfio_pci_set_power_state(vdev, PCI_D0); > - vfio_pci_set_power_state(vdev, PCI_D3hot); > - } > + /* > + * pci-core sets the device power state to an unknown value at > + * bootup and after being removed from a driver. The only > + * transition it allows from this unknown state is to D0, which > + * typically happens when a driver calls pci_enable_device(). > + * We're not ready to enable the device yet, but we do want to > + * be able to get to D3. Therefore first do a D0 transition > + * before enabling runtime PM. > + */ > + vfio_pci_set_power_state(vdev, PCI_D0); > + > +#if defined(CONFIG_PM) > + dev->driver->pm = &vfio_pci_core_pm_ops, > +#endif > + > + pm_runtime_allow(dev); > + if (!disable_idle_d3) > + pm_runtime_put(dev); > > ret = vfio_register_group_dev(&vdev->vdev); > if (ret) > @@ -1900,7 +1929,9 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev, > > out_power: > if (!disable_idle_d3) > - vfio_pci_set_power_state(vdev, PCI_D0); > + pm_runtime_get_noresume(dev); > + > + pm_runtime_forbid(dev); > out_vf: > vfio_pci_vf_uninit(vdev); > out_drvdata: > @@ -1922,8 +1953,9 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev) > vfio_pci_vga_uninit(vdev); > > if (!disable_idle_d3) > - vfio_pci_set_power_state(vdev, PCI_D0); > + pm_runtime_get_noresume(dev); > > + pm_runtime_forbid(dev); > dev_set_drvdata(dev, NULL); > } > EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device); > @@ -1984,18 +2016,26 @@ int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn) > > /* > * The PF power state should always be higher than the VF power > - * state. If PF is in the low power state, then change the > - * power state to D0 first before enabling SR-IOV. > + * state. If PF is in the runtime suspended state, then resume > + * it first before enabling SR-IOV. > */ > - vfio_pci_set_power_state(vdev, PCI_D0); > - ret = pci_enable_sriov(pdev, nr_virtfn); > + ret = pm_runtime_resume_and_get(&pdev->dev); > if (ret) > goto out_del; > + > + ret = pci_enable_sriov(pdev, nr_virtfn); > + if (ret) { > + pm_runtime_put(&pdev->dev); > + goto out_del; > + } > ret = nr_virtfn; > goto out_put; > } > > - pci_disable_sriov(pdev); > + if (pci_num_vf(pdev)) { > + pci_disable_sriov(pdev); > + pm_runtime_put(&pdev->dev); > + } > > out_del: > mutex_lock(&vfio_pci_sriov_pfs_mutex); > @@ -2072,6 +2112,30 @@ vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set) > return pdev; > } > > +static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set) > +{ > + struct vfio_pci_core_device *cur_pm; > + struct vfio_pci_core_device *cur; > + int ret = 0; > + > + list_for_each_entry(cur_pm, &dev_set->device_list, vdev.dev_set_list) { > + ret = pm_runtime_resume_and_get(&cur_pm->pdev->dev); > + if (ret < 0) > + break; > + } > + > + if (!ret) > + return 0; > + > + list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { > + if (cur == cur_pm) > + break; > + pm_runtime_put(&cur->pdev->dev); > + } > + > + return ret; > +} The above works, but maybe could be a little cleaner taking advantage of list_for_each_entry_continue_reverse as: { struct vfio_pci_core_device *cur; int ret; list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { ret = pm_runtime_resume_and_get(&cur->pdev->dev); if (ret) goto unwind; } return 0; unwind: list_for_each_entry_continue_reverse(cur, &dev_set->device_list, vdev.dev_set_list) pm_runtime_put(&cur->pdev->dev); return ret; } Thanks, Alex > + > /* > * We need to get memory_lock for each device, but devices can share mmap_lock, > * therefore we need to zap and hold the vma_lock for each device, and only then > @@ -2178,43 +2242,38 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set) > * - At least one of the affected devices is marked dirty via > * needs_reset (such as by lack of FLR support) > * Then attempt to perform that bus or slot reset. > - * Returns true if the dev_set was reset. > */ > -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set) > +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set) > { > struct vfio_pci_core_device *cur; > struct pci_dev *pdev; > - int ret; > + bool reset_done = false; > > if (!vfio_pci_dev_set_needs_reset(dev_set)) > - return false; > + return; > > pdev = vfio_pci_dev_set_resettable(dev_set); > if (!pdev) > - return false; > + return; > > /* > - * The pci_reset_bus() will reset all the devices in the bus. > - * The power state can be non-D0 for some of the devices in the bus. > - * For these devices, the pci_reset_bus() will internally set > - * the power state to D0 without vfio driver involvement. > - * For the devices which have NoSoftRst-, the reset function can > - * cause the PCI config space reset without restoring the original > - * state (saved locally in 'vdev->pm_save'). > + * Some of the devices in the bus can be in the runtime suspended > + * state. Increment the usage count for all the devices in the dev_set > + * before reset and decrement the same after reset. > */ > - list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) > - vfio_pci_set_power_state(cur, PCI_D0); > + if (!disable_idle_d3 && vfio_pci_dev_set_pm_runtime_get(dev_set)) > + return; > > - ret = pci_reset_bus(pdev); > - if (ret) > - return false; > + if (!pci_reset_bus(pdev)) > + reset_done = true; > > list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { > - cur->needs_reset = false; > + if (reset_done) > + cur->needs_reset = false; > + > if (!disable_idle_d3) > - vfio_pci_set_power_state(cur, PCI_D3hot); > + pm_runtime_put(&cur->pdev->dev); > } > - return true; > } > > void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,
On 5/5/2022 1:12 AM, Alex Williamson wrote: > On Mon, 25 Apr 2022 14:56:12 +0530 > Abhishek Sahu <abhsahu@nvidia.com> wrote: > >> Currently, there is very limited power management support >> available in the upstream vfio_pci_core based drivers. If there >> are no users of the device, then the PCI device will be moved into >> D3hot state by writing directly into PCI PM registers. This D3hot >> state help in saving power but we can achieve zero power consumption >> if we go into the D3cold state. The D3cold state cannot be possible >> with native PCI PM. It requires interaction with platform firmware >> which is system-specific. To go into low power states (including D3cold), >> the runtime PM framework can be used which internally interacts with PCI >> and platform firmware and puts the device into the lowest possible >> D-States. >> >> This patch registers vfio_pci_core based drivers with the >> runtime PM framework. >> >> 1. The PCI core framework takes care of most of the runtime PM >> related things. For enabling the runtime PM, the PCI driver needs to >> decrement the usage count and needs to provide 'struct dev_pm_ops' >> at least. The runtime suspend/resume callbacks are optional and needed >> only if we need to do any extra handling. Now there are multiple >> vfio_pci_core based drivers. Instead of assigning the >> 'struct dev_pm_ops' in individual parent driver, the vfio_pci_core >> itself assigns the 'struct dev_pm_ops'. There are other drivers where >> the 'struct dev_pm_ops' is being assigned inside core layer >> (For example, wlcore_probe() and some sound based driver, etc.). >> >> 2. This patch provides the stub implementation of 'struct dev_pm_ops'. >> The subsequent patch will provide the runtime suspend/resume >> callbacks. All the config state saving, and PCI power management >> related things will be done by PCI core framework itself inside its >> runtime suspend/resume callbacks (pci_pm_runtime_suspend() and >> pci_pm_runtime_resume()). >> >> 3. Inside pci_reset_bus(), all the devices in dev_set needs to be >> runtime resumed. vfio_pci_dev_set_pm_runtime_get() will take >> care of the runtime resume and its error handling. >> >> 4. Inside vfio_pci_core_disable(), the device usage count always needs >> to be decremented which was incremented in vfio_pci_core_enable(). >> >> 5. Since the runtime PM framework will provide the same functionality, >> so directly writing into PCI PM config register can be replaced with >> the use of runtime PM routines. Also, the use of runtime PM can help >> us in more power saving. >> >> In the systems which do not support D3cold, >> >> With the existing implementation: >> >> // PCI device >> # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state >> D3hot >> // upstream bridge >> # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state >> D0 >> >> With runtime PM: >> >> // PCI device >> # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state >> D3hot >> // upstream bridge >> # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state >> D3hot >> >> So, with runtime PM, the upstream bridge or root port will also go >> into lower power state which is not possible with existing >> implementation. >> >> In the systems which support D3cold, >> >> // PCI device >> # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state >> D3hot >> // upstream bridge >> # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state >> D0 >> >> With runtime PM: >> >> // PCI device >> # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state >> D3cold >> // upstream bridge >> # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state >> D3cold >> >> So, with runtime PM, both the PCI device and upstream bridge will >> go into D3cold state. >> >> 6. If 'disable_idle_d3' module parameter is set, then also the runtime >> PM will be enabled, but in this case, the usage count should not be >> decremented. >> >> 7. vfio_pci_dev_set_try_reset() return value is unused now, so this >> function return type can be changed to void. >> >> 8. Use the runtime PM API's in vfio_pci_core_sriov_configure(). >> For preventing any runtime usage mismatch, pci_num_vf() has been >> called explicitly during disable. >> >> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com> >> --- >> drivers/vfio/pci/vfio_pci_core.c | 169 +++++++++++++++++++++---------- >> 1 file changed, 114 insertions(+), 55 deletions(-) >> >> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c >> index 953ac33b2f5f..aee5e0cd6137 100644 >> --- a/drivers/vfio/pci/vfio_pci_core.c >> +++ b/drivers/vfio/pci/vfio_pci_core.c >> @@ -156,7 +156,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev) >> } >> >> struct vfio_pci_group_info; >> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set); >> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set); >> static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, >> struct vfio_pci_group_info *groups); >> >> @@ -261,6 +261,19 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat >> return ret; >> } >> >> +#ifdef CONFIG_PM >> +/* >> + * The dev_pm_ops needs to be provided to make pci-driver runtime PM working, >> + * so use structure without any callbacks. >> + * >> + * The pci-driver core runtime PM routines always save the device state >> + * before going into suspended state. If the device is going into low power >> + * state with only with runtime PM ops, then no explicit handling is needed >> + * for the devices which have NoSoftRst-. >> + */ >> +static const struct dev_pm_ops vfio_pci_core_pm_ops = { }; >> +#endif >> + >> int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) >> { >> struct pci_dev *pdev = vdev->pdev; >> @@ -268,21 +281,23 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) >> u16 cmd; >> u8 msix_pos; >> >> - vfio_pci_set_power_state(vdev, PCI_D0); >> + if (!disable_idle_d3) { >> + ret = pm_runtime_resume_and_get(&pdev->dev); >> + if (ret < 0) >> + return ret; >> + } >> >> /* Don't allow our initial saved state to include busmaster */ >> pci_clear_master(pdev); >> >> ret = pci_enable_device(pdev); >> if (ret) >> - return ret; >> + goto out_power; >> >> /* If reset fails because of the device lock, fail this path entirely */ >> ret = pci_try_reset_function(pdev); >> - if (ret == -EAGAIN) { >> - pci_disable_device(pdev); >> - return ret; >> - } >> + if (ret == -EAGAIN) >> + goto out_disable_device; >> >> vdev->reset_works = !ret; >> pci_save_state(pdev); >> @@ -306,12 +321,8 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) >> } >> >> ret = vfio_config_init(vdev); >> - if (ret) { >> - kfree(vdev->pci_saved_state); >> - vdev->pci_saved_state = NULL; >> - pci_disable_device(pdev); >> - return ret; >> - } >> + if (ret) >> + goto out_free_state; >> >> msix_pos = pdev->msix_cap; >> if (msix_pos) { >> @@ -332,6 +343,16 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) >> >> >> return 0; >> + >> +out_free_state: >> + kfree(vdev->pci_saved_state); >> + vdev->pci_saved_state = NULL; >> +out_disable_device: >> + pci_disable_device(pdev); >> +out_power: >> + if (!disable_idle_d3) >> + pm_runtime_put(&pdev->dev); >> + return ret; >> } >> EXPORT_SYMBOL_GPL(vfio_pci_core_enable); >> >> @@ -439,8 +460,11 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev) >> out: >> pci_disable_device(pdev); >> >> - if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3) >> - vfio_pci_set_power_state(vdev, PCI_D3hot); >> + vfio_pci_dev_set_try_reset(vdev->vdev.dev_set); >> + >> + /* Put the pm-runtime usage counter acquired during enable */ >> + if (!disable_idle_d3) >> + pm_runtime_put(&pdev->dev); >> } >> EXPORT_SYMBOL_GPL(vfio_pci_core_disable); >> >> @@ -1879,19 +1903,24 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev, >> >> vfio_pci_probe_power_state(vdev); >> >> - if (!disable_idle_d3) { >> - /* >> - * pci-core sets the device power state to an unknown value at >> - * bootup and after being removed from a driver. The only >> - * transition it allows from this unknown state is to D0, which >> - * typically happens when a driver calls pci_enable_device(). >> - * We're not ready to enable the device yet, but we do want to >> - * be able to get to D3. Therefore first do a D0 transition >> - * before going to D3. >> - */ >> - vfio_pci_set_power_state(vdev, PCI_D0); >> - vfio_pci_set_power_state(vdev, PCI_D3hot); >> - } >> + /* >> + * pci-core sets the device power state to an unknown value at >> + * bootup and after being removed from a driver. The only >> + * transition it allows from this unknown state is to D0, which >> + * typically happens when a driver calls pci_enable_device(). >> + * We're not ready to enable the device yet, but we do want to >> + * be able to get to D3. Therefore first do a D0 transition >> + * before enabling runtime PM. >> + */ >> + vfio_pci_set_power_state(vdev, PCI_D0); >> + >> +#if defined(CONFIG_PM) >> + dev->driver->pm = &vfio_pci_core_pm_ops, >> +#endif >> + >> + pm_runtime_allow(dev); >> + if (!disable_idle_d3) >> + pm_runtime_put(dev); >> >> ret = vfio_register_group_dev(&vdev->vdev); >> if (ret) >> @@ -1900,7 +1929,9 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev, >> >> out_power: >> if (!disable_idle_d3) >> - vfio_pci_set_power_state(vdev, PCI_D0); >> + pm_runtime_get_noresume(dev); >> + >> + pm_runtime_forbid(dev); >> out_vf: >> vfio_pci_vf_uninit(vdev); >> out_drvdata: >> @@ -1922,8 +1953,9 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev) >> vfio_pci_vga_uninit(vdev); >> >> if (!disable_idle_d3) >> - vfio_pci_set_power_state(vdev, PCI_D0); >> + pm_runtime_get_noresume(dev); >> >> + pm_runtime_forbid(dev); >> dev_set_drvdata(dev, NULL); >> } >> EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device); >> @@ -1984,18 +2016,26 @@ int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn) >> >> /* >> * The PF power state should always be higher than the VF power >> - * state. If PF is in the low power state, then change the >> - * power state to D0 first before enabling SR-IOV. >> + * state. If PF is in the runtime suspended state, then resume >> + * it first before enabling SR-IOV. >> */ >> - vfio_pci_set_power_state(vdev, PCI_D0); >> - ret = pci_enable_sriov(pdev, nr_virtfn); >> + ret = pm_runtime_resume_and_get(&pdev->dev); >> if (ret) >> goto out_del; >> + >> + ret = pci_enable_sriov(pdev, nr_virtfn); >> + if (ret) { >> + pm_runtime_put(&pdev->dev); >> + goto out_del; >> + } >> ret = nr_virtfn; >> goto out_put; >> } >> >> - pci_disable_sriov(pdev); >> + if (pci_num_vf(pdev)) { >> + pci_disable_sriov(pdev); >> + pm_runtime_put(&pdev->dev); >> + } >> >> out_del: >> mutex_lock(&vfio_pci_sriov_pfs_mutex); >> @@ -2072,6 +2112,30 @@ vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set) >> return pdev; >> } >> >> +static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set) >> +{ >> + struct vfio_pci_core_device *cur_pm; >> + struct vfio_pci_core_device *cur; >> + int ret = 0; >> + >> + list_for_each_entry(cur_pm, &dev_set->device_list, vdev.dev_set_list) { >> + ret = pm_runtime_resume_and_get(&cur_pm->pdev->dev); >> + if (ret < 0) >> + break; >> + } >> + >> + if (!ret) >> + return 0; >> + >> + list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { >> + if (cur == cur_pm) >> + break; >> + pm_runtime_put(&cur->pdev->dev); >> + } >> + >> + return ret; >> +} > > The above works, but maybe could be a little cleaner taking advantage > of list_for_each_entry_continue_reverse as: > > { > struct vfio_pci_core_device *cur; > int ret; > > list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { > ret = pm_runtime_resume_and_get(&cur->pdev->dev); > if (ret) > goto unwind; > } > > return 0; > > unwind: > list_for_each_entry_continue_reverse(cur, &dev_set->device_list, vdev.dev_set_list) > pm_runtime_put(&cur->pdev->dev); > > return ret; > } > > Thanks, > Alex > Thanks Alex. I will make this change. Regards, Abhishek >> + >> /* >> * We need to get memory_lock for each device, but devices can share mmap_lock, >> * therefore we need to zap and hold the vma_lock for each device, and only then >> @@ -2178,43 +2242,38 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set) >> * - At least one of the affected devices is marked dirty via >> * needs_reset (such as by lack of FLR support) >> * Then attempt to perform that bus or slot reset. >> - * Returns true if the dev_set was reset. >> */ >> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set) >> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set) >> { >> struct vfio_pci_core_device *cur; >> struct pci_dev *pdev; >> - int ret; >> + bool reset_done = false; >> >> if (!vfio_pci_dev_set_needs_reset(dev_set)) >> - return false; >> + return; >> >> pdev = vfio_pci_dev_set_resettable(dev_set); >> if (!pdev) >> - return false; >> + return; >> >> /* >> - * The pci_reset_bus() will reset all the devices in the bus. >> - * The power state can be non-D0 for some of the devices in the bus. >> - * For these devices, the pci_reset_bus() will internally set >> - * the power state to D0 without vfio driver involvement. >> - * For the devices which have NoSoftRst-, the reset function can >> - * cause the PCI config space reset without restoring the original >> - * state (saved locally in 'vdev->pm_save'). >> + * Some of the devices in the bus can be in the runtime suspended >> + * state. Increment the usage count for all the devices in the dev_set >> + * before reset and decrement the same after reset. >> */ >> - list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) >> - vfio_pci_set_power_state(cur, PCI_D0); >> + if (!disable_idle_d3 && vfio_pci_dev_set_pm_runtime_get(dev_set)) >> + return; >> >> - ret = pci_reset_bus(pdev); >> - if (ret) >> - return false; >> + if (!pci_reset_bus(pdev)) >> + reset_done = true; >> >> list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { >> - cur->needs_reset = false; >> + if (reset_done) >> + cur->needs_reset = false; >> + >> if (!disable_idle_d3) >> - vfio_pci_set_power_state(cur, PCI_D3hot); >> + pm_runtime_put(&cur->pdev->dev); >> } >> - return true; >> } >> >> void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga, >
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 953ac33b2f5f..aee5e0cd6137 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -156,7 +156,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev) } struct vfio_pci_group_info; -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set); +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set); static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, struct vfio_pci_group_info *groups); @@ -261,6 +261,19 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat return ret; } +#ifdef CONFIG_PM +/* + * The dev_pm_ops needs to be provided to make pci-driver runtime PM working, + * so use structure without any callbacks. + * + * The pci-driver core runtime PM routines always save the device state + * before going into suspended state. If the device is going into low power + * state with only with runtime PM ops, then no explicit handling is needed + * for the devices which have NoSoftRst-. + */ +static const struct dev_pm_ops vfio_pci_core_pm_ops = { }; +#endif + int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) { struct pci_dev *pdev = vdev->pdev; @@ -268,21 +281,23 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) u16 cmd; u8 msix_pos; - vfio_pci_set_power_state(vdev, PCI_D0); + if (!disable_idle_d3) { + ret = pm_runtime_resume_and_get(&pdev->dev); + if (ret < 0) + return ret; + } /* Don't allow our initial saved state to include busmaster */ pci_clear_master(pdev); ret = pci_enable_device(pdev); if (ret) - return ret; + goto out_power; /* If reset fails because of the device lock, fail this path entirely */ ret = pci_try_reset_function(pdev); - if (ret == -EAGAIN) { - pci_disable_device(pdev); - return ret; - } + if (ret == -EAGAIN) + goto out_disable_device; vdev->reset_works = !ret; pci_save_state(pdev); @@ -306,12 +321,8 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) } ret = vfio_config_init(vdev); - if (ret) { - kfree(vdev->pci_saved_state); - vdev->pci_saved_state = NULL; - pci_disable_device(pdev); - return ret; - } + if (ret) + goto out_free_state; msix_pos = pdev->msix_cap; if (msix_pos) { @@ -332,6 +343,16 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) return 0; + +out_free_state: + kfree(vdev->pci_saved_state); + vdev->pci_saved_state = NULL; +out_disable_device: + pci_disable_device(pdev); +out_power: + if (!disable_idle_d3) + pm_runtime_put(&pdev->dev); + return ret; } EXPORT_SYMBOL_GPL(vfio_pci_core_enable); @@ -439,8 +460,11 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev) out: pci_disable_device(pdev); - if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3) - vfio_pci_set_power_state(vdev, PCI_D3hot); + vfio_pci_dev_set_try_reset(vdev->vdev.dev_set); + + /* Put the pm-runtime usage counter acquired during enable */ + if (!disable_idle_d3) + pm_runtime_put(&pdev->dev); } EXPORT_SYMBOL_GPL(vfio_pci_core_disable); @@ -1879,19 +1903,24 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev, vfio_pci_probe_power_state(vdev); - if (!disable_idle_d3) { - /* - * pci-core sets the device power state to an unknown value at - * bootup and after being removed from a driver. The only - * transition it allows from this unknown state is to D0, which - * typically happens when a driver calls pci_enable_device(). - * We're not ready to enable the device yet, but we do want to - * be able to get to D3. Therefore first do a D0 transition - * before going to D3. - */ - vfio_pci_set_power_state(vdev, PCI_D0); - vfio_pci_set_power_state(vdev, PCI_D3hot); - } + /* + * pci-core sets the device power state to an unknown value at + * bootup and after being removed from a driver. The only + * transition it allows from this unknown state is to D0, which + * typically happens when a driver calls pci_enable_device(). + * We're not ready to enable the device yet, but we do want to + * be able to get to D3. Therefore first do a D0 transition + * before enabling runtime PM. + */ + vfio_pci_set_power_state(vdev, PCI_D0); + +#if defined(CONFIG_PM) + dev->driver->pm = &vfio_pci_core_pm_ops, +#endif + + pm_runtime_allow(dev); + if (!disable_idle_d3) + pm_runtime_put(dev); ret = vfio_register_group_dev(&vdev->vdev); if (ret) @@ -1900,7 +1929,9 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev, out_power: if (!disable_idle_d3) - vfio_pci_set_power_state(vdev, PCI_D0); + pm_runtime_get_noresume(dev); + + pm_runtime_forbid(dev); out_vf: vfio_pci_vf_uninit(vdev); out_drvdata: @@ -1922,8 +1953,9 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev) vfio_pci_vga_uninit(vdev); if (!disable_idle_d3) - vfio_pci_set_power_state(vdev, PCI_D0); + pm_runtime_get_noresume(dev); + pm_runtime_forbid(dev); dev_set_drvdata(dev, NULL); } EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device); @@ -1984,18 +2016,26 @@ int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn) /* * The PF power state should always be higher than the VF power - * state. If PF is in the low power state, then change the - * power state to D0 first before enabling SR-IOV. + * state. If PF is in the runtime suspended state, then resume + * it first before enabling SR-IOV. */ - vfio_pci_set_power_state(vdev, PCI_D0); - ret = pci_enable_sriov(pdev, nr_virtfn); + ret = pm_runtime_resume_and_get(&pdev->dev); if (ret) goto out_del; + + ret = pci_enable_sriov(pdev, nr_virtfn); + if (ret) { + pm_runtime_put(&pdev->dev); + goto out_del; + } ret = nr_virtfn; goto out_put; } - pci_disable_sriov(pdev); + if (pci_num_vf(pdev)) { + pci_disable_sriov(pdev); + pm_runtime_put(&pdev->dev); + } out_del: mutex_lock(&vfio_pci_sriov_pfs_mutex); @@ -2072,6 +2112,30 @@ vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set) return pdev; } +static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set) +{ + struct vfio_pci_core_device *cur_pm; + struct vfio_pci_core_device *cur; + int ret = 0; + + list_for_each_entry(cur_pm, &dev_set->device_list, vdev.dev_set_list) { + ret = pm_runtime_resume_and_get(&cur_pm->pdev->dev); + if (ret < 0) + break; + } + + if (!ret) + return 0; + + list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { + if (cur == cur_pm) + break; + pm_runtime_put(&cur->pdev->dev); + } + + return ret; +} + /* * We need to get memory_lock for each device, but devices can share mmap_lock, * therefore we need to zap and hold the vma_lock for each device, and only then @@ -2178,43 +2242,38 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set) * - At least one of the affected devices is marked dirty via * needs_reset (such as by lack of FLR support) * Then attempt to perform that bus or slot reset. - * Returns true if the dev_set was reset. */ -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set) +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set) { struct vfio_pci_core_device *cur; struct pci_dev *pdev; - int ret; + bool reset_done = false; if (!vfio_pci_dev_set_needs_reset(dev_set)) - return false; + return; pdev = vfio_pci_dev_set_resettable(dev_set); if (!pdev) - return false; + return; /* - * The pci_reset_bus() will reset all the devices in the bus. - * The power state can be non-D0 for some of the devices in the bus. - * For these devices, the pci_reset_bus() will internally set - * the power state to D0 without vfio driver involvement. - * For the devices which have NoSoftRst-, the reset function can - * cause the PCI config space reset without restoring the original - * state (saved locally in 'vdev->pm_save'). + * Some of the devices in the bus can be in the runtime suspended + * state. Increment the usage count for all the devices in the dev_set + * before reset and decrement the same after reset. */ - list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) - vfio_pci_set_power_state(cur, PCI_D0); + if (!disable_idle_d3 && vfio_pci_dev_set_pm_runtime_get(dev_set)) + return; - ret = pci_reset_bus(pdev); - if (ret) - return false; + if (!pci_reset_bus(pdev)) + reset_done = true; list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { - cur->needs_reset = false; + if (reset_done) + cur->needs_reset = false; + if (!disable_idle_d3) - vfio_pci_set_power_state(cur, PCI_D3hot); + pm_runtime_put(&cur->pdev->dev); } - return true; } void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,
Currently, there is very limited power management support available in the upstream vfio_pci_core based drivers. If there are no users of the device, then the PCI device will be moved into D3hot state by writing directly into PCI PM registers. This D3hot state help in saving power but we can achieve zero power consumption if we go into the D3cold state. The D3cold state cannot be possible with native PCI PM. It requires interaction with platform firmware which is system-specific. To go into low power states (including D3cold), the runtime PM framework can be used which internally interacts with PCI and platform firmware and puts the device into the lowest possible D-States. This patch registers vfio_pci_core based drivers with the runtime PM framework. 1. The PCI core framework takes care of most of the runtime PM related things. For enabling the runtime PM, the PCI driver needs to decrement the usage count and needs to provide 'struct dev_pm_ops' at least. The runtime suspend/resume callbacks are optional and needed only if we need to do any extra handling. Now there are multiple vfio_pci_core based drivers. Instead of assigning the 'struct dev_pm_ops' in individual parent driver, the vfio_pci_core itself assigns the 'struct dev_pm_ops'. There are other drivers where the 'struct dev_pm_ops' is being assigned inside core layer (For example, wlcore_probe() and some sound based driver, etc.). 2. This patch provides the stub implementation of 'struct dev_pm_ops'. The subsequent patch will provide the runtime suspend/resume callbacks. All the config state saving, and PCI power management related things will be done by PCI core framework itself inside its runtime suspend/resume callbacks (pci_pm_runtime_suspend() and pci_pm_runtime_resume()). 3. Inside pci_reset_bus(), all the devices in dev_set needs to be runtime resumed. vfio_pci_dev_set_pm_runtime_get() will take care of the runtime resume and its error handling. 4. Inside vfio_pci_core_disable(), the device usage count always needs to be decremented which was incremented in vfio_pci_core_enable(). 5. Since the runtime PM framework will provide the same functionality, so directly writing into PCI PM config register can be replaced with the use of runtime PM routines. Also, the use of runtime PM can help us in more power saving. In the systems which do not support D3cold, With the existing implementation: // PCI device # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state D3hot // upstream bridge # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state D0 With runtime PM: // PCI device # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state D3hot // upstream bridge # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state D3hot So, with runtime PM, the upstream bridge or root port will also go into lower power state which is not possible with existing implementation. In the systems which support D3cold, // PCI device # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state D3hot // upstream bridge # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state D0 With runtime PM: // PCI device # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state D3cold // upstream bridge # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state D3cold So, with runtime PM, both the PCI device and upstream bridge will go into D3cold state. 6. If 'disable_idle_d3' module parameter is set, then also the runtime PM will be enabled, but in this case, the usage count should not be decremented. 7. vfio_pci_dev_set_try_reset() return value is unused now, so this function return type can be changed to void. 8. Use the runtime PM API's in vfio_pci_core_sriov_configure(). For preventing any runtime usage mismatch, pci_num_vf() has been called explicitly during disable. Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com> --- drivers/vfio/pci/vfio_pci_core.c | 169 +++++++++++++++++++++---------- 1 file changed, 114 insertions(+), 55 deletions(-)