Message ID | 6-v1-9912f1a11620+2a-fwctl_jgg@nvidia.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Introduce fwctl subystem | expand |
On 6/3/24 8:53 AM, Jason Gunthorpe wrote: > Document the purpose and rules for the fwctl subsystem. > > Link in kdocs to the doc tree. > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> > --- > Documentation/userspace-api/fwctl.rst | 269 ++++++++++++++++++++++++++ > Documentation/userspace-api/index.rst | 1 + > 2 files changed, 270 insertions(+) > create mode 100644 Documentation/userspace-api/fwctl.rst > > diff --git a/Documentation/userspace-api/fwctl.rst b/Documentation/userspace-api/fwctl.rst > new file mode 100644 > index 00000000000000..630e75a91838f0 > --- /dev/null > +++ b/Documentation/userspace-api/fwctl.rst > @@ -0,0 +1,269 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +=============== > +fwctl subsystem > +=============== > + > +:Author: Jason Gunthorpe > + > +Overview > +======== > + > +Modern devices contain extensive amounts of FW, and in many cases, are largely > +software defined pieces of hardware. The evolution of this approach is largely a software-defined > +reaction to Moore's Law where a chip tape out is now highly expensive, and the > +chip design is extremely large. Replacing fixed HW logic with a flexible and > +tightly coupled FW/HW combination is an effective risk mitigation against chip > +respin. Problems in the HW design can be counteracted in device FW. This is > +especially true for devices which present a stable and backwards compatible > +interface to the operating system driver (such as NVMe). > + > +The FW layer in devices has grown to incredible sizes and devices frequently > +integrate clusters of fast processors to run it. For example, mlx5 devices have > +over 30MB of FW code, and big configurations operate with over 1GB of FW managed > +runtime state. > + > +The availability of such a flexible layer has created quite a variety in the > +industry where single pieces of silicon are now configurable software defined software-defined > +devices and can operate in substantially different ways depending on the need. > +Further we often see cases where specific sites wish to operate devices in ways Further, like in the next paragraph. > +that are highly specialized and require applications that have been tailored to > +their unique configuration. > + > +Further, devices have become multi-functional and integrated to the point they > +no longer fit neatly into the kernel's division of subsystems. Modern > +multi-functional devices have drivers, such as bnxt/ice/mlx5/pds, that span many > +subsystems while sharing the underlying hardware using the auxiliary device > +system. > + > +All together this creates a challenge for the operating system, where devices > +have an expansive FW environment that needs robust device-specific debugging > +support, and FW driven functionality that is not well suited to “generic” FW-driven > +interfaces. fwctl seeks to allow access to the full device functionality from > +user space in the areas of debuggability, management, and first-boot/nth-boot > +provisioning. > + > +fwctl is aimed at the common device design pattern where the OS and FW > +communicate via an RPC message layer constructed with a queue or mailbox scheme. > +In this case the driver will typically have some layer to deliver RPC messages > +and collect RPC responses from device FW. The in-kernel subsystem drivers that > +operate the device for its primary purposes will use these RPCs to build their > +drivers, but devices also usually have a set of ancillary RPCs that don't really > +fit into any specific subsystem. For example, a HW RAID controller is primarily > +operated by the block layer but also comes with a set of RPCs to administer the > +construction of drives within the HW RAID. > + > +In the past when devices were more single function individual subsystems would function, > +grow different approaches to solving some of these common problems, for instance problems. For instance > +monitoring device health, manipulating its FLASH, debugging the FW, > +provisioning, all have various unique interfaces across the kernel. > + > +fwctl's purpose is to define a common set of limited rules, described below, > +that allow user space to securely construct and execute RPCs inside device FW. > +The rules serve as an agreement between the operating system and FW on how to > +correctly design the RPC interface. As a uAPI the subsystem provides a thin > +layer of discovery and a generic uAPI to deliver the RPCs and collect the > +response. It supports a system of user space libraries and tools which will > +use this interface to control the device using the device native protocols. > + > +Scope of Action > +--------------- > + > +fwctl drivers are strictly restricted to being a way to operate the device FW. > +It is not an avenue to access random kernel internals, or other operating system > +SW states. > + > +fwctl instances must operate on a well-defined device function, and the device > +should have a well-defined security model for what scope within the physical > +device the function is permitted to access. For instance, the most complex PCIe > +device today may broadly have several function level scopes: function-level > + > + 1. A privileged function with full access to the on-device global state and > + configuration > + > + 2. Multiple hypervisor functions with control over itself and child functions > + used with VMs > + > + 3. Multiple VM functions tightly scoped within the VM > + > +The device may create a logical parent/child relationship between these scopes, scopes; or end that line with a period and begin the next one with "For". > +for instance a child VM's FW may be within the scope of the hypervisor FW. It is > +quite common in the VFIO world that the hypervisor environment has a complex > +provisioning/profiling/configuration responsibility for the function VFIO > +assigns to the VM. > + > +Further, within the function, devices often have RPC commands that fall within > +some general scopes of action: > + > + 1. Access to function & child configuration, flash, etc that becomes live at a etc. Use FLASH as above or change above FLASH to "flash". > + function reset. > + > + 2. Access to function & child runtime configuration that kernel drivers can > + discover at runtime. > + > + 3. Read only access to function debug information that may report on FW objects Read-only > + in the function & child, including FW objects owned by other kernel > + subsystems. > + > + 4. Write access to function & child debug information strictly compatible with > + the principles of kernel lockdown and kernel integrity protection. Triggers > + a kernel Taint. > + > + 5. Full debug device access. Triggers a kernel Taint, requires CAP_SYS_RAWIO. > + > +Userspace will provide a scope label on each RPC and the kernel must enforce the > +above CAP's and taints based on that scope. A combination of kernel and FW can > +enforce that RPCs are placed in the correct scope by userspace. > + > +Denied behavior > +--------------- > + > +There are many things this interface must not allow user space to do (without a > +Taint or CAP), broadly derived from the principles of kernel lockdown. Some > +examples: > + > + 1. DMA to/from arbitrary memory, hang the system, run code in the device, or > + otherwise compromise device or system security and integrity. > + > + 2. Provide an abnormal “back door” to kernel drivers. No manipulation of kernel > + objects owned by kernel drivers. > + > + 3. Directly configure or otherwise control kernel drivers. A subsystem kernel > + driver can react to the device configuration at function reset/driver load > + time, but otherwise should not be coupled to fwctl. > + > + 4. Operate the HW in a way that overlaps with the core purpose of another > + primary kernel subsystem, such as read/write to LBAs, send/receive of > + network packets, or operate an accelerator's data plane. > + > +fwctl is not a replacement for device direct access subsystems like uacce or > +VFIO. > + > +fwctl User API > +============== > + > +.. kernel-doc:: include/uapi/fwctl/fwctl.h > +.. kernel-doc:: include/uapi/fwctl/mlx5.h > + > +sysfs Class > +----------- > + > +fwctl has a sysfs class (/sys/class/fwctl/fwctlNN/) and character devices > +(/dev/fwctl/fwctlNN) with a simple numbered scheme. The character device > +operates the iotcl uAPI described above. > + > +fwctl devices can be related to driver components in other subsystems through > +sysfs:: > + > + $ ls /sys/class/fwctl/fwctl0/device/infiniband/ > + ibp0s10f0 > + > + $ ls /sys/class/infiniband/ibp0s10f0/device/fwctl/ > + fwctl0/ > + > + $ ls /sys/devices/pci0000:00/0000:00:0a.0/fwctl/fwctl0 > + dev device power subsystem uevent > + > +User space Community > +-------------------- > + > +Drawing inspiration from nvme-cli, participating in the kernel side must come > +with a user space in a common TBD git tree, at a minimum to usefully operate the > +kernel driver. Providing such an implementation is a pre-condition to merging a > +kernel driver. > + > +The goal is to build user space community around some of the shared problems > +we all have, and ideally develop some common user space programs with some > +starting themes of: > + > + - Device in-field debugging > + > + - HW provisioning > + > + - VFIO child device profiling before VM boot > + > + - Confidential Compute topics (attestation, secure provisioning) > + > +That stretches across all subsystems in the kernel. fwupd is a great example of that stretch across > +how an excellent user space experience can emerge out of kernel-side diversity. > + > +fwctl Kernel API > +================ > + > +.. kernel-doc:: drivers/fwctl/main.c > + :export: > +.. kernel-doc:: include/linux/fwctl.h > + > +fwctl Driver design > +------------------- > + > +In many cases a fwctl driver is going to be part of a larger cross-subsystem > +device possibly using the auxiliary_device mechanism. In that case several > +subsystems are going to be sharing the same device and FW interface layer so the > +device design must already provide for isolation and co-operation between kernel cooperation > +subsystems. fwctl should fit into that same model. > + > +Part of the driver should include a description of how its scope restrictions > +and security model work. The driver and FW together must ensure that RPCs > +provided by user space are mapped to the appropriate scope. If the validation is > +done in the driver then the validation can read a 'command effects' report from > +the device, or hardwire the enforcement. If the validation is done in the FW, > +then the driver should pass the fwctl_rpc_scope to the FW along with the command. > + > +The driver and FW must co-operate to ensure that either fwctl cannot allocate cooperate > +any FW resources, or any resources it does allocate are freed on FD closure. A > +driver primarily constructed around FW RPCs may find that its core PCI function > +and RPC layer belongs under fwctl with auxiliary devices connecting to other > +subsystems. > + > +Each device type must represent a stable FW ABI, such that the userspace > +components have the same general stability we expect from the kernel. FW upgrade > +should not break the userspace tools. > + > +Security Response > +================= > + > +The kernel remains the gatekeeper for this interface. If violations of the > +scopes, security or isolation principles are found, we have options to let > +devices fix them with a FW update, push a kernel patch to parse and block RPC > +commands or push a kernel patch to block entire firmware versions, or devices. no comma needed ^ > + > +While the kernel can always directly parse and restrict RPCs, it is expected > +that the existing kernel pattern of allowing drivers to delegate validation to > +FW to be a useful design. (and one that can be abused...) > + > +Existing Similar Examples > +========================= > + > +The approach described in this document is not a new idea. Direct, or near > +direct device access has been offered by the kernel in different areas for > +decades. With more devices wanting to follow this design pattern it is becoming > +clear that it is not entirely well understood and, more importantly, the > +security considerations are not well defined or agreed upon. > + > +Some examples: > + > + - HW RAID controllers. This includes RPCs to do things like compose drives into > + a RAID volume, configure RAID parameters, monitor the HW and more. > + > + - Baseboard managers. RPCs for configuring settings in the device and more > + > + - NVMe vendor command capsules. nvme-cli provides access to some monitoring > + functions that different products have defined, but more exists. > + > + - CXL also has a NVMe like vendor command system. NVMe-like > + > + - DRM allows user space drivers to send commands to the device via kernel > + mediation > + > + - RDMA allows user space drivers to directly push commands to the device > + without kernel involvement > + > + - Various “raw” APIs, raw HID (SDL2), raw USB, NVMe Generic Interface, etc > + > +The first 4 would be examples of areas that fwctl intends to cover. I would s/would be/are/ fwiw. > + > +Some key lessons learned from these past efforts are the importance of having a > +common user space project to use as a pre-condition for obtaining a kernel > +driver. Developing good community around useful software in user space is key to > +getting companies to fund participation to enable their products. > diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst > index 5926115ec0ed86..9685942fc8a21f 100644 > --- a/Documentation/userspace-api/index.rst > +++ b/Documentation/userspace-api/index.rst > @@ -43,6 +43,7 @@ Devices and I/O > > accelerators/ocxl > dma-buf-alloc-exchange > + fwctl > gpio/index > iommu > iommufd
On Tue, Jun 04, 2024 at 07:31:10PM -0700, Randy Dunlap wrote: > > +Modern devices contain extensive amounts of FW, and in many cases, are largely > > +software defined pieces of hardware. The evolution of this approach is largely a > > software-defined Thanks a lot Randy, I picked up all your notes. > > +While the kernel can always directly parse and restrict RPCs, it is expected > > +that the existing kernel pattern of allowing drivers to delegate validation to > > +FW to be a useful design. > > (and one that can be abused...) I would really like to write a paragraph about this "abuse", Dan has some good thoughts on this as well. Did you have a specific "abuse" in your mind? Thanks, Jason
On 6/5/24 9:03 AM, Jason Gunthorpe wrote: > On Tue, Jun 04, 2024 at 07:31:10PM -0700, Randy Dunlap wrote: > >>> +Modern devices contain extensive amounts of FW, and in many cases, are largely >>> +software defined pieces of hardware. The evolution of this approach is largely a >> >> software-defined > > Thanks a lot Randy, I picked up all your notes. > >>> +While the kernel can always directly parse and restrict RPCs, it is expected >>> +that the existing kernel pattern of allowing drivers to delegate validation to >>> +FW to be a useful design. >> >> (and one that can be abused...) > > I would really like to write a paragraph about this "abuse", Dan has > some good thoughts on this as well. Did you have a specific "abuse" > in your mind? No, I don't. It just seems very open (but ioctls are just as open).
diff --git a/Documentation/userspace-api/fwctl.rst b/Documentation/userspace-api/fwctl.rst new file mode 100644 index 00000000000000..630e75a91838f0 --- /dev/null +++ b/Documentation/userspace-api/fwctl.rst @@ -0,0 +1,269 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +fwctl subsystem +=============== + +:Author: Jason Gunthorpe + +Overview +======== + +Modern devices contain extensive amounts of FW, and in many cases, are largely +software defined pieces of hardware. The evolution of this approach is largely a +reaction to Moore's Law where a chip tape out is now highly expensive, and the +chip design is extremely large. Replacing fixed HW logic with a flexible and +tightly coupled FW/HW combination is an effective risk mitigation against chip +respin. Problems in the HW design can be counteracted in device FW. This is +especially true for devices which present a stable and backwards compatible +interface to the operating system driver (such as NVMe). + +The FW layer in devices has grown to incredible sizes and devices frequently +integrate clusters of fast processors to run it. For example, mlx5 devices have +over 30MB of FW code, and big configurations operate with over 1GB of FW managed +runtime state. + +The availability of such a flexible layer has created quite a variety in the +industry where single pieces of silicon are now configurable software defined +devices and can operate in substantially different ways depending on the need. +Further we often see cases where specific sites wish to operate devices in ways +that are highly specialized and require applications that have been tailored to +their unique configuration. + +Further, devices have become multi-functional and integrated to the point they +no longer fit neatly into the kernel's division of subsystems. Modern +multi-functional devices have drivers, such as bnxt/ice/mlx5/pds, that span many +subsystems while sharing the underlying hardware using the auxiliary device +system. + +All together this creates a challenge for the operating system, where devices +have an expansive FW environment that needs robust device-specific debugging +support, and FW driven functionality that is not well suited to “generic” +interfaces. fwctl seeks to allow access to the full device functionality from +user space in the areas of debuggability, management, and first-boot/nth-boot +provisioning. + +fwctl is aimed at the common device design pattern where the OS and FW +communicate via an RPC message layer constructed with a queue or mailbox scheme. +In this case the driver will typically have some layer to deliver RPC messages +and collect RPC responses from device FW. The in-kernel subsystem drivers that +operate the device for its primary purposes will use these RPCs to build their +drivers, but devices also usually have a set of ancillary RPCs that don't really +fit into any specific subsystem. For example, a HW RAID controller is primarily +operated by the block layer but also comes with a set of RPCs to administer the +construction of drives within the HW RAID. + +In the past when devices were more single function individual subsystems would +grow different approaches to solving some of these common problems, for instance +monitoring device health, manipulating its FLASH, debugging the FW, +provisioning, all have various unique interfaces across the kernel. + +fwctl's purpose is to define a common set of limited rules, described below, +that allow user space to securely construct and execute RPCs inside device FW. +The rules serve as an agreement between the operating system and FW on how to +correctly design the RPC interface. As a uAPI the subsystem provides a thin +layer of discovery and a generic uAPI to deliver the RPCs and collect the +response. It supports a system of user space libraries and tools which will +use this interface to control the device using the device native protocols. + +Scope of Action +--------------- + +fwctl drivers are strictly restricted to being a way to operate the device FW. +It is not an avenue to access random kernel internals, or other operating system +SW states. + +fwctl instances must operate on a well-defined device function, and the device +should have a well-defined security model for what scope within the physical +device the function is permitted to access. For instance, the most complex PCIe +device today may broadly have several function level scopes: + + 1. A privileged function with full access to the on-device global state and + configuration + + 2. Multiple hypervisor functions with control over itself and child functions + used with VMs + + 3. Multiple VM functions tightly scoped within the VM + +The device may create a logical parent/child relationship between these scopes, +for instance a child VM's FW may be within the scope of the hypervisor FW. It is +quite common in the VFIO world that the hypervisor environment has a complex +provisioning/profiling/configuration responsibility for the function VFIO +assigns to the VM. + +Further, within the function, devices often have RPC commands that fall within +some general scopes of action: + + 1. Access to function & child configuration, flash, etc that becomes live at a + function reset. + + 2. Access to function & child runtime configuration that kernel drivers can + discover at runtime. + + 3. Read only access to function debug information that may report on FW objects + in the function & child, including FW objects owned by other kernel + subsystems. + + 4. Write access to function & child debug information strictly compatible with + the principles of kernel lockdown and kernel integrity protection. Triggers + a kernel Taint. + + 5. Full debug device access. Triggers a kernel Taint, requires CAP_SYS_RAWIO. + +Userspace will provide a scope label on each RPC and the kernel must enforce the +above CAP's and taints based on that scope. A combination of kernel and FW can +enforce that RPCs are placed in the correct scope by userspace. + +Denied behavior +--------------- + +There are many things this interface must not allow user space to do (without a +Taint or CAP), broadly derived from the principles of kernel lockdown. Some +examples: + + 1. DMA to/from arbitrary memory, hang the system, run code in the device, or + otherwise compromise device or system security and integrity. + + 2. Provide an abnormal “back door” to kernel drivers. No manipulation of kernel + objects owned by kernel drivers. + + 3. Directly configure or otherwise control kernel drivers. A subsystem kernel + driver can react to the device configuration at function reset/driver load + time, but otherwise should not be coupled to fwctl. + + 4. Operate the HW in a way that overlaps with the core purpose of another + primary kernel subsystem, such as read/write to LBAs, send/receive of + network packets, or operate an accelerator's data plane. + +fwctl is not a replacement for device direct access subsystems like uacce or +VFIO. + +fwctl User API +============== + +.. kernel-doc:: include/uapi/fwctl/fwctl.h +.. kernel-doc:: include/uapi/fwctl/mlx5.h + +sysfs Class +----------- + +fwctl has a sysfs class (/sys/class/fwctl/fwctlNN/) and character devices +(/dev/fwctl/fwctlNN) with a simple numbered scheme. The character device +operates the iotcl uAPI described above. + +fwctl devices can be related to driver components in other subsystems through +sysfs:: + + $ ls /sys/class/fwctl/fwctl0/device/infiniband/ + ibp0s10f0 + + $ ls /sys/class/infiniband/ibp0s10f0/device/fwctl/ + fwctl0/ + + $ ls /sys/devices/pci0000:00/0000:00:0a.0/fwctl/fwctl0 + dev device power subsystem uevent + +User space Community +-------------------- + +Drawing inspiration from nvme-cli, participating in the kernel side must come +with a user space in a common TBD git tree, at a minimum to usefully operate the +kernel driver. Providing such an implementation is a pre-condition to merging a +kernel driver. + +The goal is to build user space community around some of the shared problems +we all have, and ideally develop some common user space programs with some +starting themes of: + + - Device in-field debugging + + - HW provisioning + + - VFIO child device profiling before VM boot + + - Confidential Compute topics (attestation, secure provisioning) + +That stretches across all subsystems in the kernel. fwupd is a great example of +how an excellent user space experience can emerge out of kernel-side diversity. + +fwctl Kernel API +================ + +.. kernel-doc:: drivers/fwctl/main.c + :export: +.. kernel-doc:: include/linux/fwctl.h + +fwctl Driver design +------------------- + +In many cases a fwctl driver is going to be part of a larger cross-subsystem +device possibly using the auxiliary_device mechanism. In that case several +subsystems are going to be sharing the same device and FW interface layer so the +device design must already provide for isolation and co-operation between kernel +subsystems. fwctl should fit into that same model. + +Part of the driver should include a description of how its scope restrictions +and security model work. The driver and FW together must ensure that RPCs +provided by user space are mapped to the appropriate scope. If the validation is +done in the driver then the validation can read a 'command effects' report from +the device, or hardwire the enforcement. If the validation is done in the FW, +then the driver should pass the fwctl_rpc_scope to the FW along with the command. + +The driver and FW must co-operate to ensure that either fwctl cannot allocate +any FW resources, or any resources it does allocate are freed on FD closure. A +driver primarily constructed around FW RPCs may find that its core PCI function +and RPC layer belongs under fwctl with auxiliary devices connecting to other +subsystems. + +Each device type must represent a stable FW ABI, such that the userspace +components have the same general stability we expect from the kernel. FW upgrade +should not break the userspace tools. + +Security Response +================= + +The kernel remains the gatekeeper for this interface. If violations of the +scopes, security or isolation principles are found, we have options to let +devices fix them with a FW update, push a kernel patch to parse and block RPC +commands or push a kernel patch to block entire firmware versions, or devices. + +While the kernel can always directly parse and restrict RPCs, it is expected +that the existing kernel pattern of allowing drivers to delegate validation to +FW to be a useful design. + +Existing Similar Examples +========================= + +The approach described in this document is not a new idea. Direct, or near +direct device access has been offered by the kernel in different areas for +decades. With more devices wanting to follow this design pattern it is becoming +clear that it is not entirely well understood and, more importantly, the +security considerations are not well defined or agreed upon. + +Some examples: + + - HW RAID controllers. This includes RPCs to do things like compose drives into + a RAID volume, configure RAID parameters, monitor the HW and more. + + - Baseboard managers. RPCs for configuring settings in the device and more + + - NVMe vendor command capsules. nvme-cli provides access to some monitoring + functions that different products have defined, but more exists. + + - CXL also has a NVMe like vendor command system. + + - DRM allows user space drivers to send commands to the device via kernel + mediation + + - RDMA allows user space drivers to directly push commands to the device + without kernel involvement + + - Various “raw” APIs, raw HID (SDL2), raw USB, NVMe Generic Interface, etc + +The first 4 would be examples of areas that fwctl intends to cover. + +Some key lessons learned from these past efforts are the importance of having a +common user space project to use as a pre-condition for obtaining a kernel +driver. Developing good community around useful software in user space is key to +getting companies to fund participation to enable their products. diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst index 5926115ec0ed86..9685942fc8a21f 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -43,6 +43,7 @@ Devices and I/O accelerators/ocxl dma-buf-alloc-exchange + fwctl gpio/index iommu iommufd
Document the purpose and rules for the fwctl subsystem. Link in kdocs to the doc tree. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> --- Documentation/userspace-api/fwctl.rst | 269 ++++++++++++++++++++++++++ Documentation/userspace-api/index.rst | 1 + 2 files changed, 270 insertions(+) create mode 100644 Documentation/userspace-api/fwctl.rst