mbox series

[00/12] ACPI/NVDIMM: Runtime Firmware Activation

Message ID 159312902033.1850128.1712559453279208264.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
Headers show
Series ACPI/NVDIMM: Runtime Firmware Activation | expand

Message

Dan Williams June 25, 2020, 11:50 p.m. UTC
Quoting the documentation:

    Some persistent memory devices run a firmware locally on the device /
    "DIMM" to perform tasks like media management, capacity provisioning,
    and health monitoring. The process of updating that firmware typically
    involves a reboot because it has implications for in-flight memory
    transactions. However, reboots are disruptive and at least the Intel
    persistent memory platform implementation, described by the Intel ACPI
    DSM specification [1], has added support for activating firmware at
    runtime.

    [1]: https://docs.pmem.io/persistent-memory/

The approach taken is to abstract the Intel platform specific mechanism
behind a libnvdimm-generic sysfs interface. The interface could support
runtime-firmware-activation on another architecture without need to
change userspace tooling.

The ACPI NFIT implementation involves a set of device-specific-methods
(DSMs) to 'arm' individual devices for activation and bus-level
'trigger' method to execute the activation. Informational / enumeration
methods are also provided at the bus and device level.

One complicating aspect of the memory device firmware activation is that
the memory controller may need to be quiesced, no memory cycles, during
the activation. While the platform has mechanisms to support holding off
in-flight DMA during the activation, the device response to that delay
is potentially undefined. The platform may reject a runtime firmware
update if, for example a PCI-E device does not support its completion
timeout value being increased to meet the activation time. Outside of
device timeouts the quiesce period may also violate application
timeouts.

Given the above device and application timeout considerations the
implementation defaults to hooking into the suspend path to trigger the
activation, i.e. that a suspend-resume cycle (at least up to the syscore
suspend point) is required. That default policy ensures that the system
is in a quiescent state before ceasing memory controller responses for
the activate. However, if desired, runtime activation without suspend
can be forced as an override.

The ndctl utility grows the following extensions / commands to drive
this mechanism:

1/ The existing update-firmware command will 'arm' devices where the
   firmware image is staged by default.

    ndctl update-firmware all -f firmware_image.bin

2/ The existing ability to enumerate firmware-update capabilities now
   includes firmware activate capabilities at the 'bus' and 'dimm/device'
   level:

    ndctl list -BDF -b nfit_test.0
    [
      {
        "provider":"nfit_test.0",
        "dev":"ndbus2",
        "scrub_state":"idle",
        "firmware":{
          "activate_method":"suspend",
          "activate_state":"idle"
        },
        "dimms":[
          {
            "dev":"nmem1",
            "id":"cdab-0a-07e0-ffffffff",
            "handle":0,
            "phys_id":0,
            "security":"disabled",
            "firmware":{
              "current_version":0,
              "can_update":true
            }
          },
    ...

3/ When the system can support activation without quiesce, or when the
   suspend-resume requirement is going to be suppressed, the new
   activate-firmware command wraps that functionality:

    ndctl activate-firmware nfit_test.0 --force

One major open question for review is how users can trigger
firmware-activation via suspend without doing a full trip through the
BIOS. The activation currently requires CONFIG_PM_DEBUG to enable that
flow. This seems an awkward dependency for something that is expected to
be a production capability.

---

Dan Williams (12):
      libnvdimm: Validate command family indices
      ACPI: NFIT: Move bus_dsm_mask out of generic nvdimm_bus_descriptor
      ACPI: NFIT: Define runtime firmware activation commands
      tools/testing/nvdimm: Cleanup dimm index passing
      tools/testing/nvdimm: Add command debug messages
      tools/testing/nvdimm: Prepare nfit_ctl_test() for ND_CMD_CALL emulation
      tools/testing/nvdimm: Emulate firmware activation commands
      driver-core: Introduce DEVICE_ATTR_ADMIN_{RO,RW}
      libnvdimm: Convert to DEVICE_ATTR_ADMIN_RO()
      libnvdimm: Add runtime firmware activation sysfs interface
      PM, libnvdimm: Add syscore_quiesced() callback for firmware activation
      ACPI: NFIT: Add runtime firmware activate support


 Documentation/ABI/testing/sysfs-bus-nfit           |   35 ++
 Documentation/ABI/testing/sysfs-bus-nvdimm         |    2 
 .../driver-api/nvdimm/firmware-activate.rst        |   74 +++
 drivers/acpi/nfit/core.c                           |  146 +++++--
 drivers/acpi/nfit/intel.c                          |  426 ++++++++++++++++++++
 drivers/acpi/nfit/intel.h                          |   61 +++
 drivers/acpi/nfit/nfit.h                           |   39 ++
 drivers/base/syscore.c                             |   18 +
 drivers/nvdimm/bus.c                               |   46 ++
 drivers/nvdimm/core.c                              |  103 +++++
 drivers/nvdimm/dimm_devs.c                         |   99 +++++
 drivers/nvdimm/namespace_devs.c                    |    2 
 drivers/nvdimm/nd-core.h                           |    1 
 drivers/nvdimm/pfn_devs.c                          |    2 
 drivers/nvdimm/region_devs.c                       |    2 
 include/linux/device.h                             |    4 
 include/linux/libnvdimm.h                          |   53 ++
 include/linux/syscore_ops.h                        |    2 
 include/linux/sysfs.h                              |    7 
 include/uapi/linux/ndctl.h                         |    5 
 kernel/power/suspend.c                             |    2 
 tools/testing/nvdimm/test/nfit.c                   |  367 ++++++++++++++---
 22 files changed, 1382 insertions(+), 114 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-bus-nvdimm
 create mode 100644 Documentation/driver-api/nvdimm/firmware-activate.rst

base-commit: 48778464bb7d346b47157d21ffde2af6b2d39110

Comments

Rafael J. Wysocki June 26, 2020, 2:22 p.m. UTC | #1
On Fri, Jun 26, 2020 at 2:06 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> Quoting the documentation:
>
>     Some persistent memory devices run a firmware locally on the device /
>     "DIMM" to perform tasks like media management, capacity provisioning,
>     and health monitoring. The process of updating that firmware typically
>     involves a reboot because it has implications for in-flight memory
>     transactions. However, reboots are disruptive and at least the Intel
>     persistent memory platform implementation, described by the Intel ACPI
>     DSM specification [1], has added support for activating firmware at
>     runtime.
>
>     [1]: https://docs.pmem.io/persistent-memory/
>
> The approach taken is to abstract the Intel platform specific mechanism
> behind a libnvdimm-generic sysfs interface. The interface could support
> runtime-firmware-activation on another architecture without need to
> change userspace tooling.
>
> The ACPI NFIT implementation involves a set of device-specific-methods
> (DSMs) to 'arm' individual devices for activation and bus-level
> 'trigger' method to execute the activation. Informational / enumeration
> methods are also provided at the bus and device level.
>
> One complicating aspect of the memory device firmware activation is that
> the memory controller may need to be quiesced, no memory cycles, during
> the activation. While the platform has mechanisms to support holding off
> in-flight DMA during the activation, the device response to that delay
> is potentially undefined. The platform may reject a runtime firmware
> update if, for example a PCI-E device does not support its completion
> timeout value being increased to meet the activation time. Outside of
> device timeouts the quiesce period may also violate application
> timeouts.
>
> Given the above device and application timeout considerations the
> implementation defaults to hooking into the suspend path to trigger the
> activation, i.e. that a suspend-resume cycle (at least up to the syscore
> suspend point) is required.

Well, that doesn't work if the suspend method for the system is set to
suspend-to-idle (for example, via /sys/power/mem_sleep), because the
syscore callbacks are not invoked in that case.

Also you probably don't need the device power state toggling that
happens during regular suspend/resume (you may not want it even for
some devices).

The hibernation freeze/thaw may be a better match and there is some
test support in there already that may be kind of co-opted for your
use case.

Cheers!
Dan Williams June 26, 2020, 6:43 p.m. UTC | #2
On Fri, Jun 26, 2020 at 7:22 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Fri, Jun 26, 2020 at 2:06 AM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > Quoting the documentation:
> >
> >     Some persistent memory devices run a firmware locally on the device /
> >     "DIMM" to perform tasks like media management, capacity provisioning,
> >     and health monitoring. The process of updating that firmware typically
> >     involves a reboot because it has implications for in-flight memory
> >     transactions. However, reboots are disruptive and at least the Intel
> >     persistent memory platform implementation, described by the Intel ACPI
> >     DSM specification [1], has added support for activating firmware at
> >     runtime.
> >
> >     [1]: https://docs.pmem.io/persistent-memory/
> >
> > The approach taken is to abstract the Intel platform specific mechanism
> > behind a libnvdimm-generic sysfs interface. The interface could support
> > runtime-firmware-activation on another architecture without need to
> > change userspace tooling.
> >
> > The ACPI NFIT implementation involves a set of device-specific-methods
> > (DSMs) to 'arm' individual devices for activation and bus-level
> > 'trigger' method to execute the activation. Informational / enumeration
> > methods are also provided at the bus and device level.
> >
> > One complicating aspect of the memory device firmware activation is that
> > the memory controller may need to be quiesced, no memory cycles, during
> > the activation. While the platform has mechanisms to support holding off
> > in-flight DMA during the activation, the device response to that delay
> > is potentially undefined. The platform may reject a runtime firmware
> > update if, for example a PCI-E device does not support its completion
> > timeout value being increased to meet the activation time. Outside of
> > device timeouts the quiesce period may also violate application
> > timeouts.
> >
> > Given the above device and application timeout considerations the
> > implementation defaults to hooking into the suspend path to trigger the
> > activation, i.e. that a suspend-resume cycle (at least up to the syscore
> > suspend point) is required.
>
> Well, that doesn't work if the suspend method for the system is set to
> suspend-to-idle (for example, via /sys/power/mem_sleep), because the
> syscore callbacks are not invoked in that case.
>
> Also you probably don't need the device power state toggling that
> happens during regular suspend/resume (you may not want it even for
> some devices).
>
> The hibernation freeze/thaw may be a better match and there is some
> test support in there already that may be kind of co-opted for your
> use case.

Hmm, yes I guess freeze should be sufficient to quiesce most
device-DMA in the general case as applications will stop sending
requests. I do expect some RDMA devices will happily keep on
transmitting, but that likely will need explicit mitigation. It also
appears the suspend callback for at least one RDMA device
mlx5_suspend() is rather violent as it appears to fully teardown the
device context, not just suspend operations.

To be clear, what debug interface were you thinking I could glom onto
to just trigger firmware-activate at the end of the freeze phase?
Rafael J. Wysocki June 28, 2020, 5:22 p.m. UTC | #3
On Fri, Jun 26, 2020 at 8:43 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Fri, Jun 26, 2020 at 7:22 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Fri, Jun 26, 2020 at 2:06 AM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > Quoting the documentation:
> > >
> > >     Some persistent memory devices run a firmware locally on the device /
> > >     "DIMM" to perform tasks like media management, capacity provisioning,
> > >     and health monitoring. The process of updating that firmware typically
> > >     involves a reboot because it has implications for in-flight memory
> > >     transactions. However, reboots are disruptive and at least the Intel
> > >     persistent memory platform implementation, described by the Intel ACPI
> > >     DSM specification [1], has added support for activating firmware at
> > >     runtime.
> > >
> > >     [1]: https://docs.pmem.io/persistent-memory/
> > >
> > > The approach taken is to abstract the Intel platform specific mechanism
> > > behind a libnvdimm-generic sysfs interface. The interface could support
> > > runtime-firmware-activation on another architecture without need to
> > > change userspace tooling.
> > >
> > > The ACPI NFIT implementation involves a set of device-specific-methods
> > > (DSMs) to 'arm' individual devices for activation and bus-level
> > > 'trigger' method to execute the activation. Informational / enumeration
> > > methods are also provided at the bus and device level.
> > >
> > > One complicating aspect of the memory device firmware activation is that
> > > the memory controller may need to be quiesced, no memory cycles, during
> > > the activation. While the platform has mechanisms to support holding off
> > > in-flight DMA during the activation, the device response to that delay
> > > is potentially undefined. The platform may reject a runtime firmware
> > > update if, for example a PCI-E device does not support its completion
> > > timeout value being increased to meet the activation time. Outside of
> > > device timeouts the quiesce period may also violate application
> > > timeouts.
> > >
> > > Given the above device and application timeout considerations the
> > > implementation defaults to hooking into the suspend path to trigger the
> > > activation, i.e. that a suspend-resume cycle (at least up to the syscore
> > > suspend point) is required.
> >
> > Well, that doesn't work if the suspend method for the system is set to
> > suspend-to-idle (for example, via /sys/power/mem_sleep), because the
> > syscore callbacks are not invoked in that case.
> >
> > Also you probably don't need the device power state toggling that
> > happens during regular suspend/resume (you may not want it even for
> > some devices).
> >
> > The hibernation freeze/thaw may be a better match and there is some
> > test support in there already that may be kind of co-opted for your
> > use case.
>
> Hmm, yes I guess freeze should be sufficient to quiesce most
> device-DMA in the general case as applications will stop sending
> requests.

It is expected to be sufficient to quiesce all of them.

If that is not the case, the integrity of the hibernation image cannot
be guaranteed on the system in question.

> I do expect some RDMA devices will happily keep on
> transmitting, but that likely will need explicit mitigation. It also
> appears the suspend callback for at least one RDMA device
> mlx5_suspend() is rather violent as it appears to fully teardown the
> device context, not just suspend operations.
>
> To be clear, what debug interface were you thinking I could glom onto
> to just trigger firmware-activate at the end of the freeze phase?

Functionally, the same as for suspend, but using the hibernation
interface, so "echo platform > /sys/power/pm_test" followed by "echo
disk > /sys/power/state".

But it might be cleaner to introduce a special "hibernation mode", ie.
is one more item in /sys/power/disk, that will trigger what you need
(in analogy with "test_resume").
Dan Williams June 29, 2020, 11:37 p.m. UTC | #4
On Sun, Jun 28, 2020 at 10:23 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Fri, Jun 26, 2020 at 8:43 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Fri, Jun 26, 2020 at 7:22 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >
> > > On Fri, Jun 26, 2020 at 2:06 AM Dan Williams <dan.j.williams@intel.com> wrote:
> > > >
> > > > Quoting the documentation:
> > > >
> > > >     Some persistent memory devices run a firmware locally on the device /
> > > >     "DIMM" to perform tasks like media management, capacity provisioning,
> > > >     and health monitoring. The process of updating that firmware typically
> > > >     involves a reboot because it has implications for in-flight memory
> > > >     transactions. However, reboots are disruptive and at least the Intel
> > > >     persistent memory platform implementation, described by the Intel ACPI
> > > >     DSM specification [1], has added support for activating firmware at
> > > >     runtime.
> > > >
> > > >     [1]: https://docs.pmem.io/persistent-memory/
> > > >
> > > > The approach taken is to abstract the Intel platform specific mechanism
> > > > behind a libnvdimm-generic sysfs interface. The interface could support
> > > > runtime-firmware-activation on another architecture without need to
> > > > change userspace tooling.
> > > >
> > > > The ACPI NFIT implementation involves a set of device-specific-methods
> > > > (DSMs) to 'arm' individual devices for activation and bus-level
> > > > 'trigger' method to execute the activation. Informational / enumeration
> > > > methods are also provided at the bus and device level.
> > > >
> > > > One complicating aspect of the memory device firmware activation is that
> > > > the memory controller may need to be quiesced, no memory cycles, during
> > > > the activation. While the platform has mechanisms to support holding off
> > > > in-flight DMA during the activation, the device response to that delay
> > > > is potentially undefined. The platform may reject a runtime firmware
> > > > update if, for example a PCI-E device does not support its completion
> > > > timeout value being increased to meet the activation time. Outside of
> > > > device timeouts the quiesce period may also violate application
> > > > timeouts.
> > > >
> > > > Given the above device and application timeout considerations the
> > > > implementation defaults to hooking into the suspend path to trigger the
> > > > activation, i.e. that a suspend-resume cycle (at least up to the syscore
> > > > suspend point) is required.
> > >
> > > Well, that doesn't work if the suspend method for the system is set to
> > > suspend-to-idle (for example, via /sys/power/mem_sleep), because the
> > > syscore callbacks are not invoked in that case.
> > >
> > > Also you probably don't need the device power state toggling that
> > > happens during regular suspend/resume (you may not want it even for
> > > some devices).
> > >
> > > The hibernation freeze/thaw may be a better match and there is some
> > > test support in there already that may be kind of co-opted for your
> > > use case.
> >
> > Hmm, yes I guess freeze should be sufficient to quiesce most
> > device-DMA in the general case as applications will stop sending
> > requests.
>
> It is expected to be sufficient to quiesce all of them.
>
> If that is not the case, the integrity of the hibernation image cannot
> be guaranteed on the system in question.
>

Ah, indeed, I was overlooking that property.

> > I do expect some RDMA devices will happily keep on
> > transmitting, but that likely will need explicit mitigation. It also
> > appears the suspend callback for at least one RDMA device
> > mlx5_suspend() is rather violent as it appears to fully teardown the
> > device context, not just suspend operations.
> >
> > To be clear, what debug interface were you thinking I could glom onto
> > to just trigger firmware-activate at the end of the freeze phase?
>
> Functionally, the same as for suspend, but using the hibernation
> interface, so "echo platform > /sys/power/pm_test" followed by "echo
> disk > /sys/power/state".
>
> But it might be cleaner to introduce a special "hibernation mode", ie.
> is one more item in /sys/power/disk, that will trigger what you need
> (in analogy with "test_resume").

I'll move the trigger to be after process freeze, but I'll keep it
tied to suspend-debug vs hibernate-debug. It appears the hibernate
debug path still goes through the exercise of allocating memory for
the hibernation image which is unnecessary if the goal is just to
'freeze', 'activate', and 'thaw'.
Rafael J. Wysocki June 30, 2020, 10:55 a.m. UTC | #5
On Tue, Jun 30, 2020 at 1:37 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Sun, Jun 28, 2020 at 10:23 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Fri, Jun 26, 2020 at 8:43 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > On Fri, Jun 26, 2020 at 7:22 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > >
> > > > On Fri, Jun 26, 2020 at 2:06 AM Dan Williams <dan.j.williams@intel.com> wrote:
> > > > >
> > > > > Quoting the documentation:
> > > > >
> > > > >     Some persistent memory devices run a firmware locally on the device /
> > > > >     "DIMM" to perform tasks like media management, capacity provisioning,
> > > > >     and health monitoring. The process of updating that firmware typically
> > > > >     involves a reboot because it has implications for in-flight memory
> > > > >     transactions. However, reboots are disruptive and at least the Intel
> > > > >     persistent memory platform implementation, described by the Intel ACPI
> > > > >     DSM specification [1], has added support for activating firmware at
> > > > >     runtime.
> > > > >
> > > > >     [1]: https://docs.pmem.io/persistent-memory/
> > > > >
> > > > > The approach taken is to abstract the Intel platform specific mechanism
> > > > > behind a libnvdimm-generic sysfs interface. The interface could support
> > > > > runtime-firmware-activation on another architecture without need to
> > > > > change userspace tooling.
> > > > >
> > > > > The ACPI NFIT implementation involves a set of device-specific-methods
> > > > > (DSMs) to 'arm' individual devices for activation and bus-level
> > > > > 'trigger' method to execute the activation. Informational / enumeration
> > > > > methods are also provided at the bus and device level.
> > > > >
> > > > > One complicating aspect of the memory device firmware activation is that
> > > > > the memory controller may need to be quiesced, no memory cycles, during
> > > > > the activation. While the platform has mechanisms to support holding off
> > > > > in-flight DMA during the activation, the device response to that delay
> > > > > is potentially undefined. The platform may reject a runtime firmware
> > > > > update if, for example a PCI-E device does not support its completion
> > > > > timeout value being increased to meet the activation time. Outside of
> > > > > device timeouts the quiesce period may also violate application
> > > > > timeouts.
> > > > >
> > > > > Given the above device and application timeout considerations the
> > > > > implementation defaults to hooking into the suspend path to trigger the
> > > > > activation, i.e. that a suspend-resume cycle (at least up to the syscore
> > > > > suspend point) is required.
> > > >
> > > > Well, that doesn't work if the suspend method for the system is set to
> > > > suspend-to-idle (for example, via /sys/power/mem_sleep), because the
> > > > syscore callbacks are not invoked in that case.
> > > >
> > > > Also you probably don't need the device power state toggling that
> > > > happens during regular suspend/resume (you may not want it even for
> > > > some devices).
> > > >
> > > > The hibernation freeze/thaw may be a better match and there is some
> > > > test support in there already that may be kind of co-opted for your
> > > > use case.
> > >
> > > Hmm, yes I guess freeze should be sufficient to quiesce most
> > > device-DMA in the general case as applications will stop sending
> > > requests.
> >
> > It is expected to be sufficient to quiesce all of them.
> >
> > If that is not the case, the integrity of the hibernation image cannot
> > be guaranteed on the system in question.
> >
>
> Ah, indeed, I was overlooking that property.
>
> > > I do expect some RDMA devices will happily keep on
> > > transmitting, but that likely will need explicit mitigation. It also
> > > appears the suspend callback for at least one RDMA device
> > > mlx5_suspend() is rather violent as it appears to fully teardown the
> > > device context, not just suspend operations.
> > >
> > > To be clear, what debug interface were you thinking I could glom onto
> > > to just trigger firmware-activate at the end of the freeze phase?
> >
> > Functionally, the same as for suspend, but using the hibernation
> > interface, so "echo platform > /sys/power/pm_test" followed by "echo
> > disk > /sys/power/state".
> >
> > But it might be cleaner to introduce a special "hibernation mode", ie.
> > is one more item in /sys/power/disk, that will trigger what you need
> > (in analogy with "test_resume").
>
> I'll move the trigger to be after process freeze, but I'll keep it
> tied to suspend-debug vs hibernate-debug. It appears the hibernate
> debug path still goes through the exercise of allocating memory for
> the hibernation image which is unnecessary if the goal is just to
> 'freeze', 'activate', and 'thaw'.

But you need the ->freeze and ->thaw callbacks to run which does not
happen at the process freeze stage.

If you add a new hibernation mode dedicated to the NVDIMM firmware
update, though, you can instrument the code to skip the memory
allocation if this mode is selected.