Message ID | 20211009003711.1390019-13-sathyanarayanan.kuppuswamy@linux.intel.com (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | Add TDX Guest Support (shared-mm support) | expand |
On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote: > From: Andi Kleen <ak@linux.intel.com> > > For Confidential VM guests like TDX, the host is untrusted and hence > the devices emulated by the host or any data coming from the host > cannot be trusted. So the drivers that interact with the outside world > have to be hardened by sharing memory with host on need basis > with proper hardening fixes. > > For the PCI driver case, to share the memory with the host add > pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs. > > Signed-off-by: Andi Kleen <ak@linux.intel.com> > Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> So I proposed to make all pci mappings shared, eliminating the need to patch drivers. To which Andi replied One problem with removing the ioremap opt-in is that it's still possible for drivers to get at devices without going through probe. To which Greg replied: https://lore.kernel.org/all/YVXBNJ431YIWwZdQ@kroah.com/ If there are in-kernel PCI drivers that do not do this, they need to be fixed today. Can you guys resolve the differences here? And once they are resolved, mention this in the commit log so I don't get to re-read the series just to find out nothing changed in this respect? I frankly do not believe we are anywhere near being able to harden an arbitrary kernel config against attack. How about creating a defconfig that makes sense for TDX then? Anyone deviating from that better know what they are doing, this API tweaking is just putting policy into the kernel ... > --- > Changes since v4: > * Replaced "_shared" with "_host_shared" in pci_iomap* APIs > * Fixed commit log as per review comments. > > include/asm-generic/pci_iomap.h | 6 +++++ > lib/pci_iomap.c | 47 +++++++++++++++++++++++++++++++++ > 2 files changed, 53 insertions(+) > > diff --git a/include/asm-generic/pci_iomap.h b/include/asm-generic/pci_iomap.h > index df636c6d8e6c..a4a83c8ab3cf 100644 > --- a/include/asm-generic/pci_iomap.h > +++ b/include/asm-generic/pci_iomap.h > @@ -18,6 +18,12 @@ extern void __iomem *pci_iomap_range(struct pci_dev *dev, int bar, > extern void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar, > unsigned long offset, > unsigned long maxlen); > +extern void __iomem *pci_iomap_host_shared(struct pci_dev *dev, int bar, > + unsigned long max); > +extern void __iomem *pci_iomap_host_shared_range(struct pci_dev *dev, int bar, > + unsigned long offset, > + unsigned long maxlen); > + > /* Create a virtual mapping cookie for a port on a given PCI device. > * Do not call this directly, it exists to make it easier for architectures > * to override */ > diff --git a/lib/pci_iomap.c b/lib/pci_iomap.c > index 57bd92f599ee..2816dc8715da 100644 > --- a/lib/pci_iomap.c > +++ b/lib/pci_iomap.c > @@ -25,6 +25,11 @@ static void __iomem *map_ioremap_wc(phys_addr_t addr, size_t size) > return ioremap_wc(addr, size); > } > > +static void __iomem *map_ioremap_host_shared(phys_addr_t addr, size_t size) > +{ > + return ioremap_host_shared(addr, size); > +} > + > static void __iomem *pci_iomap_range_map(struct pci_dev *dev, > int bar, > unsigned long offset, > @@ -106,6 +111,48 @@ void __iomem *pci_iomap_wc_range(struct pci_dev *dev, > } > EXPORT_SYMBOL_GPL(pci_iomap_wc_range); > > +/** > + * pci_iomap_host_shared_range - create a virtual shared mapping cookie > + * for a PCI BAR > + * @dev: PCI device that owns the BAR > + * @bar: BAR number > + * @offset: map memory at the given offset in BAR > + * @maxlen: max length of the memory to map > + * > + * Remap a pci device's resources shared in a confidential guest. > + * For more details see pci_iomap_range's documentation. So how does a driver author know when to use this function, and when to use the regular pci_iomap_range? Drivers have no idea whether they are used in a confidential guest, and which ranges are shared, it's a TDX thing ... This documentation should really address it. > + * > + * @maxlen specifies the maximum length to map. To get access to > + * the complete BAR from offset to the end, pass %0 here. > + */ > +void __iomem *pci_iomap_host_shared_range(struct pci_dev *dev, int bar, > + unsigned long offset, > + unsigned long maxlen) > +{ > + return pci_iomap_range_map(dev, bar, offset, maxlen, > + map_ioremap_host_shared, true); > +} > +EXPORT_SYMBOL_GPL(pci_iomap_host_shared_range); > + > +/** > + * pci_iomap_host_shared - create a virtual shared mapping cookie for a PCI BAR > + * @dev: PCI device that owns the BAR > + * @bar: BAR number > + * @maxlen: length of the memory to map > + * > + * See pci_iomap for details. This function creates a shared mapping > + * with the host for confidential hosts. > + * > + * @maxlen specifies the maximum length to map. To get access to the > + * complete BAR without checking for its length first, pass %0 here. > + */ > +void __iomem *pci_iomap_host_shared(struct pci_dev *dev, int bar, > + unsigned long maxlen) > +{ > + return pci_iomap_host_shared_range(dev, bar, 0, maxlen); > +} > +EXPORT_SYMBOL_GPL(pci_iomap_host_shared); > + > /** > * pci_iomap - create a virtual mapping cookie for a PCI BAR > * @dev: PCI device that owns the BAR > -- > 2.25.1
On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin <mst@redhat.com> wrote: > > On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote: > > From: Andi Kleen <ak@linux.intel.com> > > > > For Confidential VM guests like TDX, the host is untrusted and hence > > the devices emulated by the host or any data coming from the host > > cannot be trusted. So the drivers that interact with the outside world > > have to be hardened by sharing memory with host on need basis > > with proper hardening fixes. > > > > For the PCI driver case, to share the memory with the host add > > pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs. > > > > Signed-off-by: Andi Kleen <ak@linux.intel.com> > > Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> > > So I proposed to make all pci mappings shared, eliminating the need > to patch drivers. > > To which Andi replied > One problem with removing the ioremap opt-in is that > it's still possible for drivers to get at devices without going through probe. > > To which Greg replied: > https://lore.kernel.org/all/YVXBNJ431YIWwZdQ@kroah.com/ > If there are in-kernel PCI drivers that do not do this, they need to be > fixed today. > > Can you guys resolve the differences here? I agree with you and Greg here. If a driver is accessing hardware resources outside of the bind lifetime of one of the devices it supports, and in a way that neither modrobe-policy nor device-authorization -policy infrastructure can block, that sounds like a bug report. Fix those drivers instead of sprinkling ioremap_shared in select places and with unclear rules about when a driver is allowed to do "shared" mappings. Let the new device-authorization mechanism (with policy in userspace) be the central place where all of these driver "trust" issues are managed. > And once they are resolved, mention this in the commit log so > I don't get to re-read the series just to find out nothing > changed in this respect? > > I frankly do not believe we are anywhere near being able to harden > an arbitrary kernel config against attack. > How about creating a defconfig that makes sense for TDX then? > Anyone deviating from that better know what they are doing, > this API tweaking is just putting policy into the kernel ... Right, userspace authorization policy and select driver fixups seems to be the answer to the raised concerns.
On 10/9/2021 1:39 PM, Dan Williams wrote: > On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin <mst@redhat.com> wrote: >> On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote: >>> From: Andi Kleen <ak@linux.intel.com> >>> >>> For Confidential VM guests like TDX, the host is untrusted and hence >>> the devices emulated by the host or any data coming from the host >>> cannot be trusted. So the drivers that interact with the outside world >>> have to be hardened by sharing memory with host on need basis >>> with proper hardening fixes. >>> >>> For the PCI driver case, to share the memory with the host add >>> pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs. >>> >>> Signed-off-by: Andi Kleen <ak@linux.intel.com> >>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> >> So I proposed to make all pci mappings shared, eliminating the need >> to patch drivers. >> >> To which Andi replied >> One problem with removing the ioremap opt-in is that >> it's still possible for drivers to get at devices without going through probe. >> >> To which Greg replied: >> https://lore.kernel.org/all/YVXBNJ431YIWwZdQ@kroah.com/ >> If there are in-kernel PCI drivers that do not do this, they need to be >> fixed today. >> >> Can you guys resolve the differences here? > I agree with you and Greg here. If a driver is accessing hardware > resources outside of the bind lifetime of one of the devices it > supports, and in a way that neither modrobe-policy nor > device-authorization -policy infrastructure can block, that sounds > like a bug report. The 5.15 tree has something like ~2.4k IO accesses (including MMIO and others) in init functions that also register drivers (thanks Elena for the number) Some are probably old drivers that could be fixed, but it's quite a few legitimate cases. For example for platform or ISA drivers that's the only way they can be implemented because they often have no other enumeration mechanism. For PCI drivers it's rarer, but also still can happen. One example that comes to mind here is the x86 Intel uncore drivers, which support a mix of MSR, ioremap and PCI config space accesses all from the same driver. This particular example can (and should be) fixed in other ways, but similar things also happen in other drivers, and they're not all broken. Even for the broken ones they're usually for some crufty old devices that has very few users, so it's likely untestable in practice. My point is just that the ecosystem of devices that Linux supports is messy enough that there are legitimate exceptions from the "First IO only in probe call only" rule. And we can't just fix them all. Even if we could it would be hard to maintain. Using a "firewall model" hooking into a few strategic points like we're proposing here is much saner for everyone. Now we can argue about the details. Right now what we're proposing has some redundancies: it has both a device model filter and low level filter for ioremap (this patch and some others). The low level filter is for catching issues that don't clearly fit into the "enumeration<->probe" model. You could call that redundant, but I would call it defense in depth or better safe than sorry. In theory it would be enough to have the low level opt-in only, but that would have the drawback that is something gets enumerated after all you would have all kind of weird device driver failures and in some cases even killed guests. So I think it makes sense to have > Fix those drivers instead of sprinkling > ioremap_shared in select places and with unclear rules about when a > driver is allowed to do "shared" mappings. Only add it when the driver has been audited and hardened. But I agree we need on a documented process for this. I will work on some documentation for a proposal. But essentially I think it should be some variant of what Elena has outlined in her talk at Security Summit. https://static.sched.com/hosted_files/lssna2021/b6/LSS-HardeningLinuxGuestForCCC.pdf That is using extra auditing/scrutiny at review time, supported with some static code analysis that points to the interaction points, and code needs to be fuzzed explicitly. However short term it's only three virtio drivers, so this is not a urgent problem. > Let the new > device-authorization mechanism (with policy in userspace) Default policy in user space just seems to be a bad idea here. Who should know if a driver is hardened other than the kernel? Maintaining the list somewhere else just doesn't make sense to me. Also there is the more practical problem that some devices are needed for booting. For example in TDX we can't print something to the console with this mechanism, so you would never get any output before the initrd. Just seems like a nightmare for debugging anything. There really needs to be an authorization mechanism that works reasonably early. I can see a point of having user space overrides though, but we need to have a sane kernel default that works early. -Andi
> To which Andi replied > One problem with removing the ioremap opt-in is that > it's still possible for drivers to get at devices without going through probe. > > To which Greg replied: > https://lore.kernel.org/all/YVXBNJ431YIWwZdQ@kroah.com/ > If there are in-kernel PCI drivers that do not do this, they need to be > fixed today. > > Can you guys resolve the differences here? I addressed this in my other mail, but we may need more discussion. > > And once they are resolved, mention this in the commit log so > I don't get to re-read the series just to find out nothing > changed in this respect? > > I frankly do not believe we are anywhere near being able to harden > an arbitrary kernel config against attack. Why not? Device filter and the opt-ins together are a fairly strong mechanism. And it's not that they're a lot of code or super complicated either. You're essentially objecting to a single line change in your subsystem here. > How about creating a defconfig that makes sense for TDX then? TDX can be used in many different ways, I don't think a defconfig is practical. In theory you could do some Kconfig dependency (at the pain point of having separate kernel binariees), but why not just do it at run time then if you maintain the list anyways. That's much easier and saner for everyone. In the past we usually always ended up with runtime mechanism for similar things anyways. Also it turns out that the filter mechanisms are needed for some arch drivers which are not even configurable, so alone it's probably not enough, > Anyone deviating from that better know what they are doing, > this API tweaking is just putting policy into the kernel ... Hardening drivers is kernel policy. It cannot be done anywhere else. -Andi
Just as last time: This does not make any sense. ioremap is shared by definition.
On Sun, Oct 10, 2021 at 03:22:39PM -0700, Andi Kleen wrote: > > > To which Andi replied > > One problem with removing the ioremap opt-in is that > > it's still possible for drivers to get at devices without going through probe. > > > > To which Greg replied: > > https://lore.kernel.org/all/YVXBNJ431YIWwZdQ@kroah.com/ > > If there are in-kernel PCI drivers that do not do this, they need to be > > fixed today. > > > > Can you guys resolve the differences here? > > > I addressed this in my other mail, but we may need more discussion. Hopefully Greg will reply to that one. > > > > > And once they are resolved, mention this in the commit log so > > I don't get to re-read the series just to find out nothing > > changed in this respect? > > > > I frankly do not believe we are anywhere near being able to harden > > an arbitrary kernel config against attack. > > Why not? Device filter and the opt-ins together are a fairly strong > mechanism. Because it does not end with I/O operations, that's a trivial example. module unloading is famous for being racy: I just re-read that part of virtio drivers and sure enough we have bugs there, this is after they have presumably been audited, so a TDX guest is better off just disabling hot-unplug completely, and hotplug isn't far behind. Malicious filesystems can exploit many linux systems unless you take pains to limit what is mounted and how. Networking devices tend to get into the default namespaces and can do more or less whatever CAP_NET_ADMIN can. Etc. I am not saying this makes the effort worthless, I am saying userspace better know very well what it's doing, and kernel better be configured in a very specific way. > And it's not that they're a lot of code or super complicated either. > > You're essentially objecting to a single line change in your subsystem here. Well I commented on the API patch, not the virtio patch. If it's a way for a driver to say "I am hardened and audited" then I guess it should at least say so. It has nothing to do with host or sharing, that's an implementation detail, and it obscures the actual limitations of the approach, in that eventually in an ideal world all drivers would be secure and use this API. Yes, if that's the API that PCI gains then virtio will use it. > > How about creating a defconfig that makes sense for TDX then? > > TDX can be used in many different ways, I don't think a defconfig is > practical. > > In theory you could do some Kconfig dependency (at the pain point of having > separate kernel binariees), but why not just do it at run time then if you > maintain the list anyways. That's much easier and saner for everyone. In the > past we usually always ended up with runtime mechanism for similar things > anyways. > > Also it turns out that the filter mechanisms are needed for some arch > drivers which are not even configurable, so alone it's probably not enough, I guess they aren't really needed though right, or you won't try to filter them? So make them configurable? > > > Anyone deviating from that better know what they are doing, > > this API tweaking is just putting policy into the kernel ... > > Hardening drivers is kernel policy. It cannot be done anywhere else. > > > -Andi To clarify, the policy is which drivers to load into the kernel.
On 10/11/2021 12:58 AM, Christoph Hellwig wrote: > Just as last time: This does not make any sense. ioremap is shared > by definition. It's not necessarily shared with the host for confidential computing: for example BIOS mappings definitely should not be shared, but they're using ioremap today. But if you have a better term please propose something. I tried to clarify it with "shared_host", but I don't know a better term. -Andi
> Because it does not end with I/O operations, that's a trivial example. > module unloading is famous for being racy: I just re-read that part of > virtio drivers and sure enough we have bugs there, this is after > they have presumably been audited, so a TDX guest is better off > just disabling hot-unplug completely, and hotplug isn't far behind. These all shouldn't matter for a confidential guest. The only way it can be attacked is through IO, everything else is protected by hardware. Also it would all require doing something at the guest level, which we assume is not malicious. > Malicious filesystems can exploit many linux systems unless > you take pains to limit what is mounted and how. That's expected to be handled by authenticated dmcrypt and similar. Hardening at this level has been done for many years. > Networking devices tend to get into the default namespaces and can > do more or less whatever CAP_NET_ADMIN can. > Etc. Networking should be already hardened, otherwise you would have much worse problems today. > hange in your subsystem here. > Well I commented on the API patch, not the virtio patch. > If it's a way for a driver to say "I am hardened > and audited" then I guess it should at least say so. This is handled by the central allow list. We intentionally didn't want each driver to declare itself, but have a central list where changes will get more scrutiny than random driver code. But then there are the additional opt-ins for the low level firewall. These are in the API. I don't see how it could be done at the driver level, unless you want to pass in a struct device everywhere? >>> How about creating a defconfig that makes sense for TDX then? >> TDX can be used in many different ways, I don't think a defconfig is >> practical. >> >> In theory you could do some Kconfig dependency (at the pain point of having >> separate kernel binariees), but why not just do it at run time then if you >> maintain the list anyways. That's much easier and saner for everyone. In the >> past we usually always ended up with runtime mechanism for similar things >> anyways. >> >> Also it turns out that the filter mechanisms are needed for some arch >> drivers which are not even configurable, so alone it's probably not enough, > > I guess they aren't really needed though right, or you won't try to > filter them? We're addressing most of them with the device filter for platform drivers. But since we cannot stop them doing ioremap IO in their init code they also need the low level firewall. Some others that cannot be addressed have explicit disables. > So make them configurable? Why not just fix the runtime? It's much saner for everyone. Proposing to do things at build time sounds like we're in Linux 0.99 days. -Andi
On Mon, Oct 11, 2021 at 10:32:23AM -0700, Andi Kleen wrote: > > > Because it does not end with I/O operations, that's a trivial example. > > module unloading is famous for being racy: I just re-read that part of > > virtio drivers and sure enough we have bugs there, this is after > > they have presumably been audited, so a TDX guest is better off > > just disabling hot-unplug completely, and hotplug isn't far behind. > > These all shouldn't matter for a confidential guest. The only way it can be > attacked is through IO, everything else is protected by hardware. > > > Also it would all require doing something at the guest level, which we > assume is not malicious. > > > > Malicious filesystems can exploit many linux systems unless > > you take pains to limit what is mounted and how. > > That's expected to be handled by authenticated dmcrypt and similar. > Hardening at this level has been done for many years. It's possible to do it like this, sure. But that's not the only configuration, userspace needs to be smart about setting things up. Which is my point really. > > > Networking devices tend to get into the default namespaces and can > > do more or less whatever CAP_NET_ADMIN can. > > Etc. > > > Networking should be already hardened, otherwise you would have much worse > problems today. Same thing. NFS is pretty common, you are saying don't do it then. Fair enough but again, arbitrary configs just aren't going to be secure. > > > > hange in your subsystem here. > > Well I commented on the API patch, not the virtio patch. > > If it's a way for a driver to say "I am hardened > > and audited" then I guess it should at least say so. > > > This is handled by the central allow list. We intentionally didn't want each > driver to declare itself, but have a central list where changes will get > more scrutiny than random driver code. Makes sense. Additionally, distros can tweak that to their heart's content, selecting the functionality/security balance that makes sense for them. > But then there are the additional opt-ins for the low level firewall. These > are in the API. I don't see how it could be done at the driver level, unless > you want to pass in a struct device everywhere? I am just saying don't do it then. Don't build drivers that distro does not want to support into kernel. And don't load them when they are modules. > > > > How about creating a defconfig that makes sense for TDX then? > > > TDX can be used in many different ways, I don't think a defconfig is > > > practical. > > > > > > In theory you could do some Kconfig dependency (at the pain point of having > > > separate kernel binariees), but why not just do it at run time then if you > > > maintain the list anyways. That's much easier and saner for everyone. In the > > > past we usually always ended up with runtime mechanism for similar things > > > anyways. > > > > > > Also it turns out that the filter mechanisms are needed for some arch > > > drivers which are not even configurable, so alone it's probably not enough, > > > > I guess they aren't really needed though right, or you won't try to > > filter them? > > We're addressing most of them with the device filter for platform drivers. > But since we cannot stop them doing ioremap IO in their init code they also > need the low level firewall. > > Some others that cannot be addressed have explicit disables. > > > > So make them configurable? > > Why not just fix the runtime? It's much saner for everyone. Proposing to do > things at build time sounds like we're in Linux 0.99 days. > > -Andi Um. Tweaking driver code is not just build time, it's development time. At least with kconfig you don't need to patch your kernel.
On Mon, Oct 11, 2021 at 10:23:00AM -0700, Andi Kleen wrote: > > On 10/11/2021 12:58 AM, Christoph Hellwig wrote: > > Just as last time: This does not make any sense. ioremap is shared > > by definition. > > It's not necessarily shared with the host for confidential computing: for > example BIOS mappings definitely should not be shared, but they're using > ioremap today. That just needs to be fixed. > But if you have a better term please propose something. I tried to clarify > it with "shared_host", but I don't know a better term. > > > -Andi > The reason we have trouble is that it's not clear what does the API mean outside the realm of TDX. If we really, truly want an API that says "ioremap and it's a hardened driver" then I guess ioremap_hardened_driver is what you want.
On Mon, Oct 11, 2021 at 03:09:09PM -0400, Michael S. Tsirkin wrote: > The reason we have trouble is that it's not clear what does the API mean > outside the realm of TDX. > If we really, truly want an API that says "ioremap and it's a hardened > driver" then I guess ioremap_hardened_driver is what you want. Yes. And why would be we ioremap the BIOS anyway? It is not I/O memory in any of the senses we generally use ioremap for.
On Sun, Oct 10, 2021 at 3:11 PM Andi Kleen <ak@linux.intel.com> wrote: > > > On 10/9/2021 1:39 PM, Dan Williams wrote: > > On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin <mst@redhat.com> wrote: > >> On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote: > >>> From: Andi Kleen <ak@linux.intel.com> > >>> > >>> For Confidential VM guests like TDX, the host is untrusted and hence > >>> the devices emulated by the host or any data coming from the host > >>> cannot be trusted. So the drivers that interact with the outside world > >>> have to be hardened by sharing memory with host on need basis > >>> with proper hardening fixes. > >>> > >>> For the PCI driver case, to share the memory with the host add > >>> pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs. > >>> > >>> Signed-off-by: Andi Kleen <ak@linux.intel.com> > >>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> > >> So I proposed to make all pci mappings shared, eliminating the need > >> to patch drivers. > >> > >> To which Andi replied > >> One problem with removing the ioremap opt-in is that > >> it's still possible for drivers to get at devices without going through probe. > >> > >> To which Greg replied: > >> https://lore.kernel.org/all/YVXBNJ431YIWwZdQ@kroah.com/ > >> If there are in-kernel PCI drivers that do not do this, they need to be > >> fixed today. > >> > >> Can you guys resolve the differences here? > > I agree with you and Greg here. If a driver is accessing hardware > > resources outside of the bind lifetime of one of the devices it > > supports, and in a way that neither modrobe-policy nor > > device-authorization -policy infrastructure can block, that sounds > > like a bug report. > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and > others) in init functions that also register drivers (thanks Elena for > the number) > > Some are probably old drivers that could be fixed, but it's quite a few > legitimate cases. For example for platform or ISA drivers that's the > only way they can be implemented because they often have no other > enumeration mechanism. For PCI drivers it's rarer, but also still can > happen. One example that comes to mind here is the x86 Intel uncore > drivers, which support a mix of MSR, ioremap and PCI config space > accesses all from the same driver. This particular example can (and > should be) fixed in other ways, but similar things also happen in other > drivers, and they're not all broken. Even for the broken ones they're > usually for some crufty old devices that has very few users, so it's > likely untestable in practice. > > My point is just that the ecosystem of devices that Linux supports is > messy enough that there are legitimate exceptions from the "First IO > only in probe call only" rule. > > And we can't just fix them all. Even if we could it would be hard to > maintain. > > Using a "firewall model" hooking into a few strategic points like we're > proposing here is much saner for everyone. > > Now we can argue about the details. Right now what we're proposing has > some redundancies: it has both a device model filter and low level > filter for ioremap (this patch and some others). The low level filter is > for catching issues that don't clearly fit into the > "enumeration<->probe" model. You could call that redundant, but I would > call it defense in depth or better safe than sorry. In theory it would > be enough to have the low level opt-in only, but that would have the > drawback that is something gets enumerated after all you would have all > kind of weird device driver failures and in some cases even killed > guests. So I think it makes sense to have The "better safe-than-sorry" argument is hard to build consensus around. The spectre mitigations ran into similar problems where the community rightly wanted to see the details and instrument the problematic paths rather than blanket sprinkle lfence "just to be safe". In this case the rules about when a driver is suitably "hardened" are vague and the overlapping policy engines are confusing. I'd rather see more concerted efforts focused/limited core changes rather than leaf driver changes until there is a clearer definition of hardened. I.e. instead of jumping to the assertion that fixing up these init-path vulnerabilities are too big to fix, dig to the next level to provide more evidence that per-driver opt-in is the only viable option. For example, how many of these problematic paths are built-in to the average kernel config? A strawman might be to add a sprinkling error exits in the module_init() of the problematic drivers, and only fail if the module is built-in, and let modprobe policy handle the rest. > > > > Fix those drivers instead of sprinkling > > ioremap_shared in select places and with unclear rules about when a > > driver is allowed to do "shared" mappings. > > Only add it when the driver has been audited and hardened. > > But I agree we need on a documented process for this. I will work on > some documentation for a proposal. But essentially I think it should be > some variant of what Elena has outlined in her talk at Security Summit. > > https://static.sched.com/hosted_files/lssna2021/b6/LSS-HardeningLinuxGuestForCCC.pdf > > That is using extra auditing/scrutiny at review time, supported with > some static code analysis that points to the interaction points, and > code needs to be fuzzed explicitly. > > However short term it's only three virtio drivers, so this is not a > urgent problem. > > > Let the new > > device-authorization mechanism (with policy in userspace) > > > Default policy in user space just seems to be a bad idea here. Who > should know if a driver is hardened other than the kernel? Maintaining > the list somewhere else just doesn't make sense to me. I do not understand the maintenance burden correlation of where the policy is driven vs where the list is maintained? Even if I agreed with the contention that out-of-tree userspace would have a hard time tracking the "hardened" driver list there is still an in-tree userspace path to explore. E.g. perf maintains lists of things tightly coupled to the kernel, this authorized device list seems to be in the same category of data. > Also there is the more practical problem that some devices are needed > for booting. For example in TDX we can't print something to the console > with this mechanism, so you would never get any output before the > initrd. Just seems like a nightmare for debugging anything. There really > needs to be an authorization mechanism that works reasonably early. > > I can see a point of having user space overrides though, but we need to > have a sane kernel default that works early. Right, as I suggested [1], just enough early authorization to bootstrap/debug initramfs and then that can authorize the remainder. [1]: https://lore.kernel.org/all/CAPcyv4im4Tsj1SnxSWe=cAHBP1mQ=zgO-D81n2BpD+_HkpitbQ@mail.gmail.com/
> The "better safe-than-sorry" argument is hard to build consensus > around. The spectre mitigations ran into similar problems where the > community rightly wanted to see the details and instrument the > problematic paths rather than blanket sprinkle lfence "just to be > safe". But that was due to performance problems in hot paths. Nothing of this applies here. > In this case the rules about when a driver is suitably > "hardened" are vague and the overlapping policy engines are confusing. What is confusing exactly? For me it both seems very straight forward and simple (but then I'm biased) The policy is: - Have an allow list at driver registration. - Have an additional opt-in for MMIO mappings (and potentially config space, but that's not currently there) to cover init calls completely. > > I'd rather see more concerted efforts focused/limited core changes > rather than leaf driver changes until there is a clearer definition of > hardened. A hardened driver is a driver that - Had similar security (not API) oriented review of its IO operations (mainly MMIO access, but also PCI config space) as a non privileged user interface (like a ioctl). That review should be focused on memory safety. - Had some fuzzing on these IO interfaces using to be released tools. Right now it's only three virtio drivers (console, net, block) Really it's no different than what we do for every new unprivileged user interface. > I.e. instead of jumping to the assertion that fixing up > these init-path vulnerabilities are too big to fix, dig to the next > level to provide more evidence that per-driver opt-in is the only > viable option. > > For example, how many of these problematic paths are built-in to the > average kernel config? I don't think arguments from "the average kernel config" (if such a thing even exists) are useful. That's would be just hand waving. > A strawman might be to add a sprinkling error > exits in the module_init() of the problematic drivers, and only fail > if the module is built-in, and let modprobe policy handle the rest. That would be already hundreds of changes. I have no idea how such a thing could be maintained or sustained either. Really I don't even see how these alternatives can be considered. Tree sweeps should always be last resort. They're a pain for everyone. But here they're casually thrown around as alternatives to straight forward one or two line changes. > >> Default policy in user space just seems to be a bad idea here. Who >> should know if a driver is hardened other than the kernel? Maintaining >> the list somewhere else just doesn't make sense to me. > I do not understand the maintenance burden correlation of where the > policy is driven vs where the list is maintained? All the hardening and auditing happens in the kernel tree. So it seems the natural place to store the result is in the kernel tree. But there's no single package for initrd, so you would need custom configurations for all the supported distros. Also we're really arguing about a list that currently has three entries. > Even if I agreed > with the contention that out-of-tree userspace would have a hard time > tracking the "hardened" driver list there is still an in-tree > userspace path to explore. E.g. perf maintains lists of things tightly > coupled to the kernel, this authorized device list seems to be in the > same category of data. You mean the event list? perf is in the kernel tree, so it's maintained together with the kernel. But we don't have a kernel initrd. > >> Also there is the more practical problem that some devices are needed >> for booting. For example in TDX we can't print something to the console >> with this mechanism, so you would never get any output before the >> initrd. Just seems like a nightmare for debugging anything. There really >> needs to be an authorization mechanism that works reasonably early. >> >> I can see a point of having user space overrides though, but we need to >> have a sane kernel default that works early. > Right, as I suggested [1], just enough early authorization to > bootstrap/debug initramfs and then that can authorize the remainder. But how do you debug the kernel then? Making early undebuggable seems just bad policy to me. And if you fix if for the console why not add the two more entries for virtio net and block too? -Andi
> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and > others) in init functions that also register drivers (thanks Elena for > the number) To provide more numbers on this. What I can see so far from a smatch-based analysis, we have 409 __init style functions (.probe & builtin/module_ _platform_driver_probe excluded) for 5.15 with allyesconfig. The number of distinct individual IO reads (MSRs included) is much higher than 2.4k and on the range of 30k because quite often there are more than a single IO read in the same source function. The full list of accesses and the possible call paths is huge for manually looking at, but here is the list of the 409 functions if anyone wants to take a look: ['doc200x_ident_chip', 'doc_probe', 'doc2001_init', 'mtd_speedtest_init', 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init', 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551', 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx', 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom', 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next', 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one', 'init_dmars', 'iommu_init_pci', 'early_amd_iommu_init', 'late_iommu_features_init', 'detect_ivrs', 'intel_prepare_irq_remapping', 'intel_enable_irq_remapping', 'intel_cleanup_irq_remapping', 'detect_intel_iommu', 'parse_ioapics_under_ir', 'si_domain_init', 'ubi_init', 'fb_console_init', 'xenbus_probe_backend_init', 'xenbus_probe_frontend_init', 'setup_vcpu_hotplug_event', 'balloon_init', 'intel_iommu_init', 'intel_rng_mod_init', 'check_tylersburg_isoch', 'dmar_table_init', 'enable_drhd_fault_handling', 'init_acpi_pm_clocksource', 'ostm_init_clksrc', 'ftm_clockevent_init', 'ftm_clocksource_init', 'kona_timer_init', 'mtk_gpt_init', 'samsung_clockevent_init', 'samsung_clocksource_init', 'sysctr_timer_init', 'mxs_timer_init', 'sun4i_timer_init', 'at91sam926x_pit_dt_init', 'owl_timer_init', 'sun5i_setup_clockevent', 'ubi_gluebi_init', 'ubiblock_init', 'hv_init_tsc_clocksource', 'hv_init_clocksource', 'mt7621_clk_init', 'samsung_clk_register_mux', 'samsung_clk_register_gate', 'samsung_clk_register_fixed_rate', 'clk_boston_setup', 'gemini_cc_init', 'aspeed_ast2400_cc', 'aspeed_ast2500_cc', 'sun6i_rtc_clk_init', 'phy_init', 'ingenic_ost_register_clock', 'meson6_timer_init', 'atcpit100_timer_init', 'npcm7xx_clocksource_init', 'clksrc_dbx500_prcmu_init', 'skx_init', 'i10nm_init', 'sbridge_init', 'i82975x_init', 'i3000_init', 'x38_init', 'ie31200_init', 'i3200_init', 'amd64_edac_init', 'pnd2_init', 'edac_init', 'adummy_init', 'mtd_stresstest_init', 'bxt_idle_state_table_update', 'sklh_idle_state_table_update', 'skx_idle_state_table_update', 'acpi_gpio_handle_deferred_request_irqs', 'smc_findirq', 'ltpc_probe', 'com90io_probe', 'com90xx_probe', 'pcnet32_init_module', 'it87_gpio_init', 'f7188x_find', 'it8712f_wdt_find', 'f71808e_find', 'it87_wdt_init', 'f71882fg_find', 'it87_find', 'f71805f_find', 'parport_pc_init', 'asic3_irq_probe', 'sch311x_detect', 'amd_gpio_init', 'dvb_init', 'dvb_register', 'em28xx_alsa_register', 'em28xx_dvb_register', 'em28xx_rc_register', 'em28xx_video_register', 'blackbird_init', 'bttv_check_chipset', 'ivtvfb_callback_init', 'init_control', 'con_init', 'cr_pll_init', 'clk_disable_unused_subtree', 'fmi_init', 'cadet_init', 'pcm20_init', 'airo_init_module', 'bnx2i_mod_init', 'bnx2fc_mod_init', 'timer_of_irq_exit', 'init', 'kempld_init', 'ivtvfb_init', 'brcmf_core_init', 'comedi_test_init', 'tlan_eisa_probe', 'timer_probe', 'of_clk_init', '__reserved_mem_init_node', 'of_irq_init', 'mace_init', 'vortex_eisa_init', 'reset_chip', 'atp_init', 'atp_probe1', 'smc_probe', 'osi_setup', 'led_init', 'el3_init_module', 'clk_sp810_of_setup', 'ltpc_probe_dma', 'com90io_found', 'check_mirror', 'arcrimi_found', 'com90xx_found', 'intel_soc_thermal_init', 'thermal_register_governors', 'thermal_unregister_governors', 'therm_lvt_init', 'tcc_cooling_init', 'powerclamp_probe', 'intel_init', 'qcom_geni_serial_earlycon_setup', 'kgdboc_early_init', 'lpuart_console_setup', 'speakup_init', 'early_console_setup', 'init_port', 'early_serial8250_setup', 'linflex_console_setup', 'pl010_console_setup', 'register_earlycon', 'of_setup_earlycon', 'slgt_init', 'moxa_init', 'parport_pc_init_superio', 'parport_pc_find_ports', 'mousedev_init', 'ses_init', 'riocm_init', 'efi_rci2_sysfs_init', 'blogic_probe', 'blogic_init', 'blogic_init_mm_probeinfo', 'blogic_init_probeinfo_list', 'blogic_checkadapter', 'blogic_rdconfig', 'blogic_inquiry', 'adpt_init', 'clk_unprepare_unused_subtree', 'aspeed_socinfo_init', 'rcar_sysc_pd_setup', 'r8a779a0_sysc_pd_setup', 'renesas_soc_init', 'rcar_rst_init', 'rmobile_setup_pm_domain', 'mcp_write_pairing_set', 'a72_b53_rac_enable_all', 'mcp_a72_b53_set', 'brcmstb_soc_device_early_init', 'imx8mq_soc_revision', 'imx8mm_soc_uid', 'imx8mm_soc_revision', 'qe_init', 'exynos5x_clk_init', 'exynos5250_clk_init', 'exynos4_get_xom', 'create_one_cmux', 'create_one_pll', 'p2041_init_periph', 'p4080_init_periph', 'p5020_init_periph', 'p5040_init_periph', 'r9a06g032_clocks_probe', 'r8a73a4_cpg_clocks_init', 'sh73a0_cpg_clocks_init', 'cpg_div6_register', 'r8a7740_cpg_clocks_init', 'cpg_mssr_register_mod_clk', 'cpg_mssr_register_core_clk', 'rcar_gen3_cpg_clk_register', 'cpg_sd_clk_register', 'r7s9210_update_clk_table', 'rz_cpg_read_mode_pins', 'rz_cpg_clocks_init', 'rcar_r8a779a0_cpg_clk_register', 'rcar_gen2_cpg_clk_register', 'sun8i_a33_ccu_setup', 'sun8i_a23_ccu_setup', 'sun5i_ccu_init', 'suniv_f1c100s_ccu_setup', 'sun6i_a31_ccu_setup', 'sun8i_v3_v3s_ccu_init', 'sun50i_h616_ccu_setup', 'sunxi_h3_h5_ccu_init', 'sun4i_ccu_init', 'kona_ccu_init', 'ns2_genpll_scr_clk_init', 'ns2_genpll_sw_clk_init', 'ns2_lcpll_ddr_clk_init', 'ns2_lcpll_ports_clk_init', 'nsp_genpll_clk_init', 'nsp_lcpll0_clk_init', 'cygnus_genpll_clk_init', 'cygnus_lcpll0_clk_init', 'cygnus_mipipll_clk_init', 'cygnus_audiopll_clk_init', 'of_fixed_mmio_clk_setup', 'xdbc_map_pci_mmio', 'xdbc_find_dbgp', 'xdbc_bios_handoff', 'xdbc_early_setup', 'ehci_setup', 'early_xdbc_parse_parameter', 'find_cap', '__find_dbgp', 'nvidia_set_debug_port', 'detect_set_debug_port', 'early_ehci_bios_handoff', 'early_dbgp_init', 'dbgp_init', 'ulpi_init', 'hidg_init', 'xdbc_init', 'brcmstb_usb_pinmap_probe', 'dell_init', 'eisa_init_device', 'mlxcpld_led_probe', 'nas_gpio_init', 'asic3_mfd_probe', 'asic3_probe', 'watchdog_init', 'ssb_modinit', 'pt_init', 'thinkpad_acpi_module_init', 'kbd_init', 'joydev_init', 'evdev_init', 'evbug_init', 'input_leds_init', 'mk712_init', 'l4_add_card', 'ns558_init', 'apanel_init', 'ct82c710_detect', 'i8042_check_aux', 'i8042_check_mux', 'i8042_probe', 'i8042_init', 'i8042_aux_test_irq', 'ocrdma_init_module', 'input_apanel_init', 'cs5535_mfgpt_init', 'geodewdt_probe', 'duramar2150_c2port_init', 'init_ohci1394_dma_on_all_controllers', 'init_ohci1394_controller', 'rionet_init', 'nonstatic_sysfs_init', 'init_pcmcia_bus', 'devlink_class_init', 'switchtec_ntb_init', 'mport_init', 'drivetemp_init', 'omap_vout_probe', 'probe_opti_vlb', 'probe_chip_type', 'legacy_check_special_cases', 'qdi65_identify_port', 'probe_qdi_vlb', 'comedi_init', 'hv_acpi_init', 'pcistub_init_devices_late', 'bcma_host_soc_register', 'bcma_bus_early_register', 'vga_arb_device_init', 'vga_arb_select_default_device', 'zf_init', 'watchdog_deferred_registration', 'wb_smsc_wdt_init', 'w83977f_wdt_init', 'ali_find_watchdog', 'pc87413_init', 'alim7101_wdt_init', 'at91_wdt_init', 'sc1200wdt_probe', 'asr_get_base_address', 'dmi_walk_early', 'dmi_sysfs_init', 'dell_smbios_init', 'acer_wmi_init', 'get_thinkpad_model_data', 'dmi_scan_machine', 'pci_assign_unassigned_resources', 'cpcihp_generic_init', 'pnpacpi_init', 'acpi_early_processor_osc', 'acpi_processor_check_duplicates', 'acpi_early_processor_set_pdc', 'acpi_ec_dsdt_probe', 'cros_ec_lpc_init', 'tpacpi_acpi_handle_locate', 'ks_pcie_init_id', 'ks_pcie_host_init', 'pci_apply_final_quirks', 'intel_uncore_init', 'qedr_init_module', 'isapnp_peek', 'isapnp_isolate', 'init_ipmi_si', 'isapnp_build_device_list', 'pnpacpi_add_device', 'erst_init', 'intel_idle_acpi_cst_extract', 'xen_acpi_processor_init', 'acpi_scan_init', 's3_wmi_probe', 'intel_opregion_present', 'extlog_init', 'intel_pstate_init', 'via_rng_mod_init', 'amd_rng_mod_init', 'ccp_init', 'init_nsc', 'init_atmel', 'intel_rng_hw_init', 'intel_init_hw_struct', 'tlclk_init', 'mwave_init', 'applicom_init', 'hdaps_init', 'tink_board_init', 'ibm_rtl_init', 'samsung_sabi_init', 'samsung_init', 'samsung_backlight_init', 'samsung_rfkill_init_swsmi', 'samsung_lid_handling_init', 'samsung_leds_init', 'samsung_sabi_diag', 'samsung_sabi_infos', 'isst_if_mbox_init', 'pmc_atom_init', 'abituguru_detect', 'hwmon_pci_quirks', 'applesmc_init', 'abituguru3_detect', 'w83627ehf_probe', 'dme1737_isa_detect', 'smsc47m1_probe', 'pcc_cpufreq_init', 'cpufreq_p4_init', 'centrino_init', 'acpi_cpufreq_init', 'pcc_cpufreq_probe', 'intel_pstate_msrs_not_valid', 'intel_pstate_platform_pwr_mgmt_exists', 'acpi_cpufreq_boost_init', 'amd_freq_sensitivity_init', 'gic_fixup_resource', 'do_floppy_init', 'get_fdc_version', 'pf_init', 'pg_init', 'pd_init', 'pcd_init', 'rio_basic_attach]
On 10/11/2021 10:31 PM, Christoph Hellwig wrote: > On Mon, Oct 11, 2021 at 03:09:09PM -0400, Michael S. Tsirkin wrote: >> The reason we have trouble is that it's not clear what does the API mean >> outside the realm of TDX. >> If we really, truly want an API that says "ioremap and it's a hardened >> driver" then I guess ioremap_hardened_driver is what you want. > Yes. And why would be we ioremap the BIOS anyway? It is not I/O memory > in any of the senses we generally use ioremap for. I/O memory is anything outside the kernel memory map. -Andi
On 10/12/2021 11:36 AM, Reshetova, Elena wrote: >> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and >> others) in init functions that also register drivers (thanks Elena for >> the number) > To provide more numbers on this. What I can see so far from a smatch-based > analysis, we have 409 __init style functions (.probe & builtin/module_ > _platform_driver_probe excluded) for 5.15 with allyesconfig. The number of > distinct individual IO reads (MSRs included) is much higher than 2.4k and on the > range of 30k because quite often there are more than a single IO read in the > same source function. The full list of accesses and the possible call paths is huge > for manually looking at, but here is the list of the 409 functions if anyone wants > to take a look: Thanks Elena. I suspect the true number is even higher because that doesn't include IO inside calls to other modules and indirect pointers, correct? -Andi
> I suspect the true number is even higher because that doesn't include IO > inside calls to other modules and indirect pointers, correct? Actually everything should be included. Smatch has cross-function db and I am using it for getting the call chains and it follows function pointers. Also since I am starting from a list of individual read IOs, every single base read IO in drivers/* should be covered as far as I can see. But if it uses some weird IO wrappers then the actual list might be higher.
On Tue, Oct 12, 2021 at 11:57 AM Reshetova, Elena <elena.reshetova@intel.com> wrote: > > > > I suspect the true number is even higher because that doesn't include IO > > inside calls to other modules and indirect pointers, correct? > > Actually everything should be included. Smatch has cross-function db and > I am using it for getting the call chains and it follows function pointers. > Also since I am starting from a list of individual read IOs, every single > base read IO in drivers/* should be covered as far as I can see. But if it uses > some weird IO wrappers then the actual list might be higher. Why analyze individual IO calls? I thought the goal here was to disable entire classes of ioremap() users?
On 10/12/2021 12:13 PM, Dan Williams wrote: > On Tue, Oct 12, 2021 at 11:57 AM Reshetova, Elena > <elena.reshetova@intel.com> wrote: >> >>> I suspect the true number is even higher because that doesn't include IO >>> inside calls to other modules and indirect pointers, correct? >> Actually everything should be included. Smatch has cross-function db and >> I am using it for getting the call chains and it follows function pointers. >> Also since I am starting from a list of individual read IOs, every single >> base read IO in drivers/* should be covered as far as I can see. But if it uses >> some weird IO wrappers then the actual list might be higher. > Why analyze individual IO calls? I thought the goal here was to > disable entire classes of ioremap() users? This is everything that would need to be moved somewhere else if we didn't disable the entire classes of ioremap users. -Andi
On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote: > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and > > others) in init functions that also register drivers (thanks Elena for > > the number) > > To provide more numbers on this. What I can see so far from a smatch-based > analysis, we have 409 __init style functions (.probe & builtin/module_ > _platform_driver_probe excluded) for 5.15 with allyesconfig. I don't think we care about allyesconfig at all though. Just don't do that. How about allmodconfig? This is closer to what distros actually do.
On Tue, Oct 12, 2021 at 11:35 AM Andi Kleen <ak@linux.intel.com> wrote: > > > > The "better safe-than-sorry" argument is hard to build consensus > > around. The spectre mitigations ran into similar problems where the > > community rightly wanted to see the details and instrument the > > problematic paths rather than blanket sprinkle lfence "just to be > > safe". > > But that was due to performance problems in hot paths. Nothing of this > applies here. It applies because a new API that individual driver authors is being proposed and that's an ongoing maintenance burden that might be mitigated by hiding that implementation detail from leaf drivers. > > > In this case the rules about when a driver is suitably > > "hardened" are vague and the overlapping policy engines are confusing. > > What is confusing exactly? Multiple places to go to enable functionality. The device-filter firewall policy can collide with the ioremap access control policy. > For me it both seems very straight forward and simple (but then I'm biased) You seem to be having a difficult time iterating this proposal toward consensus. I don't think the base principles are being contested as much as the semantics, scope, and need for the proposed API that is in the purview of all leaf driver developers. > The policy is: > > - Have an allow list at driver registration. > > - Have an additional opt-in for MMIO mappings (and potentially config > space, but that's not currently there) to cover init calls completely. The proliferation of policy engines and driver special casing is the issue. Especially in this case where the virtio use case being opted-in is *already* in a path that has been authorized by the device-filter policy engine. I.e. why special case the ioremap() in virtio to be additionally authorized when the device has already been authorized to probe? Put another way, the easiest driver API change to merge would be no additional changes in leaf drivers. > > > > > I'd rather see more concerted efforts focused/limited core changes > > rather than leaf driver changes until there is a clearer definition of > > hardened. > > A hardened driver is a driver that > > - Had similar security (not API) oriented review of its IO operations > (mainly MMIO access, but also PCI config space) as a non privileged user > interface (like a ioctl). That review should be focused on memory safety. > > - Had some fuzzing on these IO interfaces using to be released tools. What is the intersection of ioremap() users that are outside of the proposed probe authorization regime AND want confidential computing support? > Right now it's only three virtio drivers (console, net, block) > > Really it's no different than what we do for every new unprivileged user > interface. > > > > I.e. instead of jumping to the assertion that fixing up > > these init-path vulnerabilities are too big to fix, dig to the next > > level to provide more evidence that per-driver opt-in is the only > > viable option. > > > > For example, how many of these problematic paths are built-in to the > > average kernel config? > > I don't think arguments from "the average kernel config" (if such a > thing even exists) are useful. That's would be just hand waving. I'm trying to bridge to your contention that this enabling can not rely on custom kernel configs and must offer protection on the same kernel image that might ship in the host, but lets set this aside and focus on when and where leaf drivers need to adopt a new API. > > A strawman might be to add a sprinkling error > > exits in the module_init() of the problematic drivers, and only fail > > if the module is built-in, and let modprobe policy handle the rest. > > > That would be already hundreds of changes. I have no idea how such a > thing could be maintained or sustained either. > > Really I don't even see how these alternatives can be considered. Tree > sweeps should always be last resort. They're a pain for everyone. But > here they're casually thrown around as alternatives to straight forward > one or two line changes. If it looked straightforward I'm not sure we would be having this discussion, I think it's reasonable to ask if this is a per-driver opt-in responsibility that must be added in addition to probe authorization. > >> Default policy in user space just seems to be a bad idea here. Who > >> should know if a driver is hardened other than the kernel? Maintaining > >> the list somewhere else just doesn't make sense to me. > > I do not understand the maintenance burden correlation of where the > > policy is driven vs where the list is maintained? > > All the hardening and auditing happens in the kernel tree. So it seems > the natural place to store the result is in the kernel tree. > > But there's no single package for initrd, so you would need custom > configurations for all the supported distros. > > Also we're really arguing about a list that currently has three entries. > > > > Even if I agreed > > with the contention that out-of-tree userspace would have a hard time > > tracking the "hardened" driver list there is still an in-tree > > userspace path to explore. E.g. perf maintains lists of things tightly > > coupled to the kernel, this authorized device list seems to be in the > > same category of data. > > You mean the event list? perf is in the kernel tree, so it's maintained > together with the kernel. > > But we don't have a kernel initrd. I'm proposing that this list is either tiny and slow moving enough for initrd builders to track manually, or it's a data file that ships in distro kernel packages that initrd builders can pull in. > >> Also there is the more practical problem that some devices are needed > >> for booting. For example in TDX we can't print something to the console > >> with this mechanism, so you would never get any output before the > >> initrd. Just seems like a nightmare for debugging anything. There really > >> needs to be an authorization mechanism that works reasonably early. > >> > >> I can see a point of having user space overrides though, but we need to > >> have a sane kernel default that works early. > > Right, as I suggested [1], just enough early authorization to > > bootstrap/debug initramfs and then that can authorize the remainder. > > But how do you debug the kernel then? Making early undebuggable seems > just bad policy to me. I am not proposing making the early undebuggable. > > And if you fix if for the console why not add the two more entries for > virtio net and block too? Again because there seems to be struggling consensus around what criteria constitutes being added to this list. In order to move this series forward I'm trying to find common ground.
On Tue, Oct 12, 2021 at 02:14:44PM -0700, Dan Williams wrote: > Especially in this case where the virtio use case being > opted-in is *already* in a path that has been authorized by the > device-filter policy engine. That's a good point. Andi, how about setting a per-device flag if its ID has been allowed and then making pci_iomap create a shared mapping transparently?
On 10/12/2021 2:18 PM, Michael S. Tsirkin wrote: > On Tue, Oct 12, 2021 at 02:14:44PM -0700, Dan Williams wrote: >> Especially in this case where the virtio use case being >> opted-in is *already* in a path that has been authorized by the >> device-filter policy engine. > That's a good point. Andi, how about setting a per-device flag > if its ID has been allowed and then making pci_iomap create > a shared mapping transparently? Yes for pci_iomap we could do that. If someone uses raw ioremap without a device it won't work, but I don't think that's the case for virtio at least. I suppose we could solve that problem if it actually happens. -Andi
>> But that was due to performance problems in hot paths. Nothing of this >> applies here. > It applies because a new API that individual driver authors is being > proposed and that's an ongoing maintenance burden that might be > mitigated by hiding that implementation detail from leaf drivers. Right now we're only talking about 2 places to change, and none of those are actually in individual drivers, but in the virtio generic code and in the MSI code. While there might be drivers in the future that do it directly it will be always the exception, normal drivers don't have to deal with this. >> For me it both seems very straight forward and simple (but then I'm biased) > You seem to be having a difficult time iterating this proposal toward > consensus. I don't think the base principles are being contested as > much as the semantics, scope, and need for the proposed API that is in > the purview of all leaf driver developers. Right now there is no leaf driver changed at all. > >>> I'd rather see more concerted efforts focused/limited core changes >>> rather than leaf driver changes until there is a clearer definition of >>> hardened. >> A hardened driver is a driver that >> >> - Had similar security (not API) oriented review of its IO operations >> (mainly MMIO access, but also PCI config space) as a non privileged user >> interface (like a ioctl). That review should be focused on memory safety. >> >> - Had some fuzzing on these IO interfaces using to be released tools. > What is the intersection of ioremap() users that are outside of the > proposed probe authorization regime AND want confidential computing > support? Right now it's zero I believe. That is there is other low level code that sets memory shared, but it's not using ioremap, but some other mechanisms. > > are needed >>>> for booting. For example in TDX we can't print something to the console >>>> with this mechanism, so you would never get any output before the >>>> initrd. Just seems like a nightmare for debugging anything. There really >>>> needs to be an authorization mechanism that works reasonably early. >>>> >>>> I can see a point of having user space overrides though, but we need to >>>> have a sane kernel default that works early. >>> Right, as I suggested [1], just enough early authorization to >>> bootstrap/debug initramfs and then that can authorize the remainder. >> But how do you debug the kernel then? Making early undebuggable seems >> just bad policy to me. > I am not proposing making the early undebuggable. That's the implication of moving the policy into initrd. If only initrd can authorize then it won't be possible to authorize before initrd, thus the early console won't work. -Andi
On Tue, Oct 12, 2021 at 2:28 PM Andi Kleen <ak@linux.intel.com> wrote: [..] > >> But how do you debug the kernel then? Making early undebuggable seems > >> just bad policy to me. > > I am not proposing making the early undebuggable. > > > That's the implication of moving the policy into initrd. > > > If only initrd can authorize then it won't be possible to authorize > before initrd, thus the early console won't work. Again, the proposal is that the allow-list is limited to just enough devices to startup and debug the initramfs and no more. Everything else can be dynamic, and this allows for a powerful custom override interface without needing to debate additional ABI like command line overrides, and minimizes future changes to this kernel-internal allow-list.
> On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote: > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and > > > others) in init functions that also register drivers (thanks Elena for > > > the number) > > > > To provide more numbers on this. What I can see so far from a smatch-based > > analysis, we have 409 __init style functions (.probe & builtin/module_ > > _platform_driver_probe excluded) for 5.15 with allyesconfig. > > I don't think we care about allyesconfig at all though. > Just don't do that. > How about allmodconfig? This is closer to what distros actually do. It does not make any difference really for the content of the /drivers/*: gives 408 __init style functions doing IO (.probe & builtin/module_ > > _platform_driver_probe excluded) for 5.15 with allmodconfig: ['doc200x_ident_chip', 'doc_probe', 'doc2001_init', 'mtd_speedtest_init', 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init', 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551', 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx', 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom', 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next', 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one', 'init_dmars', 'iommu_init_pci', 'early_amd_iommu_init', 'late_iommu_features_init', 'detect_ivrs', 'intel_prepare_irq_remapping', 'intel_enable_irq_remapping', 'intel_cleanup_irq_remapping', 'detect_intel_iommu', 'parse_ioapics_under_ir', 'si_domain_init', 'ubi_init', 'fb_console_init', 'xenbus_probe_backend_init', 'xenbus_probe_frontend_init', 'setup_vcpu_hotplug_event', 'balloon_init', 'intel_iommu_init', 'intel_rng_mod_init', 'check_tylersburg_isoch', 'dmar_table_init', 'enable_drhd_fault_handling', 'init_acpi_pm_clocksource', 'ostm_init_clksrc', 'ftm_clockevent_init', 'ftm_clocksource_init', 'kona_timer_init', 'mtk_gpt_init', 'samsung_clockevent_init', 'samsung_clocksource_init', 'sysctr_timer_init', 'mxs_timer_init', 'sun4i_timer_init', 'at91sam926x_pit_dt_init', 'owl_timer_init', 'sun5i_setup_clockevent', 'ubi_gluebi_init', 'ubiblock_init', 'hv_init_tsc_clocksource', 'hv_init_clocksource', 'mt7621_clk_init', 'samsung_clk_register_mux', 'samsung_clk_register_gate', 'samsung_clk_register_fixed_rate', 'clk_boston_setup', 'gemini_cc_init', 'aspeed_ast2400_cc', 'aspeed_ast2500_cc', 'sun6i_rtc_clk_init', 'phy_init', 'ingenic_ost_register_clock', 'meson6_timer_init', 'atcpit100_timer_init', 'npcm7xx_clocksource_init', 'clksrc_dbx500_prcmu_init', 'skx_init', 'i10nm_init', 'sbridge_init', 'i82975x_init', 'i3000_init', 'x38_init', 'ie31200_init', 'i3200_init', 'amd64_edac_init', 'pnd2_init', 'edac_init', 'adummy_init', 'mtd_stresstest_init', 'bxt_idle_state_table_update', 'sklh_idle_state_table_update', 'skx_idle_state_table_update', 'acpi_gpio_handle_deferred_request_irqs', 'smc_findirq', 'ltpc_probe', 'com90io_probe', 'com90xx_probe', 'pcnet32_init_module', 'it87_gpio_init', 'f7188x_find', 'it8712f_wdt_find', 'f71808e_find', 'it87_wdt_init', 'f71882fg_find', 'it87_find', 'f71805f_find', 'parport_pc_init', 'asic3_irq_probe', 'sch311x_detect', 'amd_gpio_init', 'dvb_init', 'dvb_register', 'em28xx_alsa_register', 'em28xx_dvb_register', 'em28xx_rc_register', 'em28xx_video_register', 'blackbird_init', 'bttv_check_chipset', 'ivtvfb_callback_init', 'init_control', 'con_init', 'cr_pll_init', 'clk_disable_unused_subtree', 'fmi_init', 'cadet_init', 'pcm20_init', 'airo_init_module', 'bnx2i_mod_init', 'bnx2fc_mod_init', 'timer_of_irq_exit', 'init', 'kempld_init', 'ivtvfb_init', 'brcmf_core_init', 'comedi_test_init', 'tlan_eisa_probe', 'timer_probe', 'of_clk_init', '__reserved_mem_init_node', 'of_irq_init', 'mace_init', 'vortex_eisa_init', 'reset_chip', 'atp_init', 'atp_probe1', 'smc_probe', 'osi_setup', 'led_init', 'el3_init_module', 'clk_sp810_of_setup', 'ltpc_probe_dma', 'com90io_found', 'check_mirror', 'arcrimi_found', 'com90xx_found', 'intel_soc_thermal_init', 'thermal_register_governors', 'thermal_unregister_governors', 'therm_lvt_init', 'tcc_cooling_init', 'powerclamp_probe', 'intel_init', 'qcom_geni_serial_earlycon_setup', 'kgdboc_early_init', 'lpuart_console_setup', 'speakup_init', 'early_console_setup', 'init_port', 'early_serial8250_setup', 'linflex_console_setup', 'register_earlycon', 'of_setup_earlycon', 'slgt_init', 'moxa_init', 'parport_pc_init_superio', 'parport_pc_find_ports', 'mousedev_init', 'ses_init', 'riocm_init', 'efi_rci2_sysfs_init', 'blogic_probe', 'blogic_init', 'blogic_init_mm_probeinfo', 'blogic_init_probeinfo_list', 'blogic_checkadapter', 'blogic_rdconfig', 'blogic_inquiry', 'adpt_init', 'clk_unprepare_unused_subtree', 'aspeed_socinfo_init', 'rcar_sysc_pd_setup', 'r8a779a0_sysc_pd_setup', 'renesas_soc_init', 'rcar_rst_init', 'rmobile_setup_pm_domain', 'mcp_write_pairing_set', 'a72_b53_rac_enable_all', 'mcp_a72_b53_set', 'brcmstb_soc_device_early_init', 'imx8mq_soc_revision', 'imx8mm_soc_uid', 'imx8mm_soc_revision', 'qe_init', 'exynos5x_clk_init', 'exynos5250_clk_init', 'exynos4_get_xom', 'create_one_cmux', 'create_one_pll', 'p2041_init_periph', 'p4080_init_periph', 'p5020_init_periph', 'p5040_init_periph', 'r9a06g032_clocks_probe', 'r8a73a4_cpg_clocks_init', 'sh73a0_cpg_clocks_init', 'cpg_div6_register', 'r8a7740_cpg_clocks_init', 'cpg_mssr_register_mod_clk', 'cpg_mssr_register_core_clk', 'rcar_gen3_cpg_clk_register', 'cpg_sd_clk_register', 'r7s9210_update_clk_table', 'rz_cpg_read_mode_pins', 'rz_cpg_clocks_init', 'rcar_r8a779a0_cpg_clk_register', 'rcar_gen2_cpg_clk_register', 'sun8i_a33_ccu_setup', 'sun8i_a23_ccu_setup', 'sun5i_ccu_init', 'suniv_f1c100s_ccu_setup', 'sun6i_a31_ccu_setup', 'sun8i_v3_v3s_ccu_init', 'sun50i_h616_ccu_setup', 'sunxi_h3_h5_ccu_init', 'sun4i_ccu_init', 'kona_ccu_init', 'ns2_genpll_scr_clk_init', 'ns2_genpll_sw_clk_init', 'ns2_lcpll_ddr_clk_init', 'ns2_lcpll_ports_clk_init', 'nsp_genpll_clk_init', 'nsp_lcpll0_clk_init', 'cygnus_genpll_clk_init', 'cygnus_lcpll0_clk_init', 'cygnus_mipipll_clk_init', 'cygnus_audiopll_clk_init', 'of_fixed_mmio_clk_setup', 'xdbc_map_pci_mmio', 'xdbc_find_dbgp', 'xdbc_bios_handoff', 'xdbc_early_setup', 'ehci_setup', 'early_xdbc_parse_parameter', 'find_cap', '__find_dbgp', 'nvidia_set_debug_port', 'detect_set_debug_port', 'early_ehci_bios_handoff', 'early_dbgp_init', 'dbgp_init', 'ulpi_init', 'hidg_init', 'xdbc_init', 'brcmstb_usb_pinmap_probe', 'dell_init', 'eisa_init_device', 'mlxcpld_led_probe', 'nas_gpio_init', 'asic3_mfd_probe', 'asic3_probe', 'watchdog_init', 'ssb_modinit', 'pt_init', 'thinkpad_acpi_module_init', 'kbd_init', 'joydev_init', 'evdev_init', 'evbug_init', 'input_leds_init', 'mk712_init', 'l4_add_card', 'ns558_init', 'apanel_init', 'ct82c710_detect', 'i8042_check_aux', 'i8042_check_mux', 'i8042_probe', 'i8042_init', 'i8042_aux_test_irq', 'ocrdma_init_module', 'input_apanel_init', 'cs5535_mfgpt_init', 'geodewdt_probe', 'duramar2150_c2port_init', 'init_ohci1394_dma_on_all_controllers', 'init_ohci1394_controller', 'rionet_init', 'nonstatic_sysfs_init', 'init_pcmcia_bus', 'devlink_class_init', 'switchtec_ntb_init', 'mport_init', 'drivetemp_init', 'omap_vout_probe', 'probe_opti_vlb', 'probe_chip_type', 'legacy_check_special_cases', 'qdi65_identify_port', 'probe_qdi_vlb', 'comedi_init', 'hv_acpi_init', 'pcistub_init_devices_late', 'bcma_host_soc_register', 'bcma_bus_early_register', 'vga_arb_device_init', 'vga_arb_select_default_device', 'zf_init', 'watchdog_deferred_registration', 'wb_smsc_wdt_init', 'w83977f_wdt_init', 'ali_find_watchdog', 'pc87413_init', 'alim7101_wdt_init', 'at91_wdt_init', 'sc1200wdt_probe', 'asr_get_base_address', 'dmi_walk_early', 'dmi_sysfs_init', 'dell_smbios_init', 'acer_wmi_init', 'get_thinkpad_model_data', 'dmi_scan_machine', 'pci_assign_unassigned_resources', 'cpcihp_generic_init', 'pnpacpi_init', 'acpi_early_processor_osc', 'acpi_processor_check_duplicates', 'acpi_early_processor_set_pdc', 'acpi_ec_dsdt_probe', 'cros_ec_lpc_init', 'tpacpi_acpi_handle_locate', 'ks_pcie_init_id', 'ks_pcie_host_init', 'pci_apply_final_quirks', 'intel_uncore_init', 'qedr_init_module', 'isapnp_peek', 'isapnp_isolate', 'init_ipmi_si', 'isapnp_build_device_list', 'pnpacpi_add_device', 'erst_init', 'intel_idle_acpi_cst_extract', 'xen_acpi_processor_init', 'acpi_scan_init', 's3_wmi_probe', 'intel_opregion_present', 'extlog_init', 'intel_pstate_init', 'via_rng_mod_init', 'amd_rng_mod_init', 'ccp_init', 'init_nsc', 'init_atmel', 'intel_rng_hw_init', 'intel_init_hw_struct', 'tlclk_init', 'mwave_init', 'applicom_init', 'hdaps_init', 'tink_board_init', 'ibm_rtl_init', 'samsung_sabi_init', 'samsung_init', 'samsung_backlight_init', 'samsung_rfkill_init_swsmi', 'samsung_lid_handling_init', 'samsung_leds_init', 'samsung_sabi_diag', 'samsung_sabi_infos', 'isst_if_mbox_init', 'pmc_atom_init', 'abituguru_detect', 'hwmon_pci_quirks', 'applesmc_init', 'abituguru3_detect', 'w83627ehf_probe', 'dme1737_isa_detect', 'smsc47m1_probe', 'pcc_cpufreq_init', 'cpufreq_p4_init', 'centrino_init', 'acpi_cpufreq_init', 'pcc_cpufreq_probe', 'intel_pstate_msrs_not_valid', 'intel_pstate_platform_pwr_mgmt_exists', 'acpi_cpufreq_boost_init', 'amd_freq_sensitivity_init', 'gic_fixup_resource', 'do_floppy_init', 'get_fdc_version', 'pf_init', 'pg_init', 'pd_init', 'pcd_init', 'rio_basic_attach']
On Thu, Oct 14, 2021 at 06:32:32AM +0000, Reshetova, Elena wrote: > > On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote: > > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and > > > > others) in init functions that also register drivers (thanks Elena for > > > > the number) > > > > > > To provide more numbers on this. What I can see so far from a smatch-based > > > analysis, we have 409 __init style functions (.probe & builtin/module_ > > > _platform_driver_probe excluded) for 5.15 with allyesconfig. > > > > I don't think we care about allyesconfig at all though. > > Just don't do that. > > How about allmodconfig? This is closer to what distros actually do. > > It does not make any difference really for the content of the /drivers/*: > gives 408 __init style functions doing IO (.probe & builtin/module_ > > > _platform_driver_probe excluded) for 5.15 with allmodconfig: > > ['doc200x_ident_chip', > 'doc_probe', 'doc2001_init', 'mtd_speedtest_init', > 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init', > 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551', > 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx', > 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom', > 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next', > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one', Um. ARM? Which architecture is this build for? > 'init_dmars', 'iommu_init_pci', 'early_amd_iommu_init', > 'late_iommu_features_init', 'detect_ivrs', > 'intel_prepare_irq_remapping', 'intel_enable_irq_remapping', > 'intel_cleanup_irq_remapping', 'detect_intel_iommu', > 'parse_ioapics_under_ir', 'si_domain_init', 'ubi_init', > 'fb_console_init', 'xenbus_probe_backend_init', > 'xenbus_probe_frontend_init', 'setup_vcpu_hotplug_event', > 'balloon_init', 'intel_iommu_init', 'intel_rng_mod_init', > 'check_tylersburg_isoch', 'dmar_table_init', > 'enable_drhd_fault_handling', 'init_acpi_pm_clocksource', > 'ostm_init_clksrc', 'ftm_clockevent_init', 'ftm_clocksource_init', > 'kona_timer_init', 'mtk_gpt_init', 'samsung_clockevent_init', > 'samsung_clocksource_init', 'sysctr_timer_init', 'mxs_timer_init', > 'sun4i_timer_init', 'at91sam926x_pit_dt_init', 'owl_timer_init', > 'sun5i_setup_clockevent', 'ubi_gluebi_init', 'ubiblock_init', > 'hv_init_tsc_clocksource', 'hv_init_clocksource', 'mt7621_clk_init', > 'samsung_clk_register_mux', 'samsung_clk_register_gate', > 'samsung_clk_register_fixed_rate', 'clk_boston_setup', > 'gemini_cc_init', 'aspeed_ast2400_cc', 'aspeed_ast2500_cc', > 'sun6i_rtc_clk_init', 'phy_init', 'ingenic_ost_register_clock', > 'meson6_timer_init', 'atcpit100_timer_init', > 'npcm7xx_clocksource_init', 'clksrc_dbx500_prcmu_init', 'skx_init', > 'i10nm_init', 'sbridge_init', 'i82975x_init', 'i3000_init', > 'x38_init', 'ie31200_init', 'i3200_init', 'amd64_edac_init', > 'pnd2_init', 'edac_init', 'adummy_init', 'mtd_stresstest_init', > 'bxt_idle_state_table_update', 'sklh_idle_state_table_update', > 'skx_idle_state_table_update', > 'acpi_gpio_handle_deferred_request_irqs', 'smc_findirq', 'ltpc_probe', > 'com90io_probe', 'com90xx_probe', 'pcnet32_init_module', > 'it87_gpio_init', 'f7188x_find', 'it8712f_wdt_find', 'f71808e_find', > 'it87_wdt_init', 'f71882fg_find', 'it87_find', 'f71805f_find', > 'parport_pc_init', 'asic3_irq_probe', 'sch311x_detect', > 'amd_gpio_init', 'dvb_init', 'dvb_register', 'em28xx_alsa_register', > 'em28xx_dvb_register', 'em28xx_rc_register', 'em28xx_video_register', > 'blackbird_init', 'bttv_check_chipset', 'ivtvfb_callback_init', > 'init_control', 'con_init', 'cr_pll_init', > 'clk_disable_unused_subtree', 'fmi_init', 'cadet_init', 'pcm20_init', > 'airo_init_module', 'bnx2i_mod_init', 'bnx2fc_mod_init', > 'timer_of_irq_exit', 'init', 'kempld_init', 'ivtvfb_init', > 'brcmf_core_init', 'comedi_test_init', 'tlan_eisa_probe', > 'timer_probe', 'of_clk_init', '__reserved_mem_init_node', > 'of_irq_init', 'mace_init', 'vortex_eisa_init', 'reset_chip', > 'atp_init', 'atp_probe1', 'smc_probe', 'osi_setup', 'led_init', > 'el3_init_module', 'clk_sp810_of_setup', 'ltpc_probe_dma', > 'com90io_found', 'check_mirror', 'arcrimi_found', 'com90xx_found', > 'intel_soc_thermal_init', 'thermal_register_governors', > 'thermal_unregister_governors', 'therm_lvt_init', 'tcc_cooling_init', > 'powerclamp_probe', 'intel_init', 'qcom_geni_serial_earlycon_setup', > 'kgdboc_early_init', 'lpuart_console_setup', 'speakup_init', > 'early_console_setup', 'init_port', 'early_serial8250_setup', > 'linflex_console_setup', 'register_earlycon', 'of_setup_earlycon', > 'slgt_init', 'moxa_init', 'parport_pc_init_superio', > 'parport_pc_find_ports', 'mousedev_init', 'ses_init', 'riocm_init', > 'efi_rci2_sysfs_init', 'blogic_probe', 'blogic_init', > 'blogic_init_mm_probeinfo', 'blogic_init_probeinfo_list', > 'blogic_checkadapter', 'blogic_rdconfig', 'blogic_inquiry', > 'adpt_init', 'clk_unprepare_unused_subtree', 'aspeed_socinfo_init', > 'rcar_sysc_pd_setup', 'r8a779a0_sysc_pd_setup', 'renesas_soc_init', > 'rcar_rst_init', 'rmobile_setup_pm_domain', 'mcp_write_pairing_set', > 'a72_b53_rac_enable_all', 'mcp_a72_b53_set', > 'brcmstb_soc_device_early_init', 'imx8mq_soc_revision', > 'imx8mm_soc_uid', 'imx8mm_soc_revision', 'qe_init', > 'exynos5x_clk_init', 'exynos5250_clk_init', 'exynos4_get_xom', > 'create_one_cmux', 'create_one_pll', 'p2041_init_periph', > 'p4080_init_periph', 'p5020_init_periph', 'p5040_init_periph', > 'r9a06g032_clocks_probe', 'r8a73a4_cpg_clocks_init', > 'sh73a0_cpg_clocks_init', 'cpg_div6_register', > 'r8a7740_cpg_clocks_init', 'cpg_mssr_register_mod_clk', > 'cpg_mssr_register_core_clk', 'rcar_gen3_cpg_clk_register', > 'cpg_sd_clk_register', 'r7s9210_update_clk_table', > 'rz_cpg_read_mode_pins', 'rz_cpg_clocks_init', > 'rcar_r8a779a0_cpg_clk_register', 'rcar_gen2_cpg_clk_register', > 'sun8i_a33_ccu_setup', 'sun8i_a23_ccu_setup', 'sun5i_ccu_init', > 'suniv_f1c100s_ccu_setup', 'sun6i_a31_ccu_setup', > 'sun8i_v3_v3s_ccu_init', 'sun50i_h616_ccu_setup', > 'sunxi_h3_h5_ccu_init', 'sun4i_ccu_init', 'kona_ccu_init', > 'ns2_genpll_scr_clk_init', 'ns2_genpll_sw_clk_init', > 'ns2_lcpll_ddr_clk_init', 'ns2_lcpll_ports_clk_init', > 'nsp_genpll_clk_init', 'nsp_lcpll0_clk_init', > 'cygnus_genpll_clk_init', 'cygnus_lcpll0_clk_init', > 'cygnus_mipipll_clk_init', 'cygnus_audiopll_clk_init', > 'of_fixed_mmio_clk_setup', 'xdbc_map_pci_mmio', 'xdbc_find_dbgp', > 'xdbc_bios_handoff', 'xdbc_early_setup', 'ehci_setup', > 'early_xdbc_parse_parameter', 'find_cap', '__find_dbgp', > 'nvidia_set_debug_port', 'detect_set_debug_port', > 'early_ehci_bios_handoff', 'early_dbgp_init', 'dbgp_init', > 'ulpi_init', 'hidg_init', 'xdbc_init', 'brcmstb_usb_pinmap_probe', > 'dell_init', 'eisa_init_device', 'mlxcpld_led_probe', 'nas_gpio_init', > 'asic3_mfd_probe', 'asic3_probe', 'watchdog_init', 'ssb_modinit', > 'pt_init', 'thinkpad_acpi_module_init', 'kbd_init', 'joydev_init', > 'evdev_init', 'evbug_init', 'input_leds_init', 'mk712_init', > 'l4_add_card', 'ns558_init', 'apanel_init', 'ct82c710_detect', > 'i8042_check_aux', 'i8042_check_mux', 'i8042_probe', 'i8042_init', > 'i8042_aux_test_irq', 'ocrdma_init_module', 'input_apanel_init', > 'cs5535_mfgpt_init', 'geodewdt_probe', 'duramar2150_c2port_init', > 'init_ohci1394_dma_on_all_controllers', 'init_ohci1394_controller', > 'rionet_init', 'nonstatic_sysfs_init', 'init_pcmcia_bus', > 'devlink_class_init', 'switchtec_ntb_init', 'mport_init', > 'drivetemp_init', 'omap_vout_probe', 'probe_opti_vlb', > 'probe_chip_type', 'legacy_check_special_cases', > 'qdi65_identify_port', 'probe_qdi_vlb', 'comedi_init', 'hv_acpi_init', > 'pcistub_init_devices_late', 'bcma_host_soc_register', > 'bcma_bus_early_register', 'vga_arb_device_init', > 'vga_arb_select_default_device', 'zf_init', > 'watchdog_deferred_registration', 'wb_smsc_wdt_init', > 'w83977f_wdt_init', 'ali_find_watchdog', 'pc87413_init', > 'alim7101_wdt_init', 'at91_wdt_init', 'sc1200wdt_probe', > 'asr_get_base_address', 'dmi_walk_early', 'dmi_sysfs_init', > 'dell_smbios_init', 'acer_wmi_init', 'get_thinkpad_model_data', > 'dmi_scan_machine', 'pci_assign_unassigned_resources', > 'cpcihp_generic_init', 'pnpacpi_init', 'acpi_early_processor_osc', > 'acpi_processor_check_duplicates', 'acpi_early_processor_set_pdc', > 'acpi_ec_dsdt_probe', 'cros_ec_lpc_init', 'tpacpi_acpi_handle_locate', > 'ks_pcie_init_id', 'ks_pcie_host_init', 'pci_apply_final_quirks', > 'intel_uncore_init', 'qedr_init_module', 'isapnp_peek', > 'isapnp_isolate', 'init_ipmi_si', 'isapnp_build_device_list', > 'pnpacpi_add_device', 'erst_init', 'intel_idle_acpi_cst_extract', > 'xen_acpi_processor_init', 'acpi_scan_init', 's3_wmi_probe', > 'intel_opregion_present', 'extlog_init', 'intel_pstate_init', > 'via_rng_mod_init', 'amd_rng_mod_init', 'ccp_init', 'init_nsc', > 'init_atmel', 'intel_rng_hw_init', 'intel_init_hw_struct', > 'tlclk_init', 'mwave_init', 'applicom_init', 'hdaps_init', > 'tink_board_init', 'ibm_rtl_init', 'samsung_sabi_init', > 'samsung_init', 'samsung_backlight_init', 'samsung_rfkill_init_swsmi', > 'samsung_lid_handling_init', 'samsung_leds_init', 'samsung_sabi_diag', > 'samsung_sabi_infos', 'isst_if_mbox_init', 'pmc_atom_init', > 'abituguru_detect', 'hwmon_pci_quirks', 'applesmc_init', > 'abituguru3_detect', 'w83627ehf_probe', 'dme1737_isa_detect', > 'smsc47m1_probe', 'pcc_cpufreq_init', 'cpufreq_p4_init', > 'centrino_init', 'acpi_cpufreq_init', 'pcc_cpufreq_probe', > 'intel_pstate_msrs_not_valid', > 'intel_pstate_platform_pwr_mgmt_exists', 'acpi_cpufreq_boost_init', > 'amd_freq_sensitivity_init', 'gic_fixup_resource', 'do_floppy_init', > 'get_fdc_version', 'pf_init', 'pg_init', 'pd_init', 'pcd_init', > 'rio_basic_attach']
> On Thu, Oct 14, 2021 at 06:32:32AM +0000, Reshetova, Elena wrote: > > > On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote: > > > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and > > > > > others) in init functions that also register drivers (thanks Elena for > > > > > the number) > > > > > > > > To provide more numbers on this. What I can see so far from a smatch-based > > > > analysis, we have 409 __init style functions (.probe & builtin/module_ > > > > _platform_driver_probe excluded) for 5.15 with allyesconfig. > > > > > > I don't think we care about allyesconfig at all though. > > > Just don't do that. > > > How about allmodconfig? This is closer to what distros actually do. > > > > It does not make any difference really for the content of the /drivers/*: > > gives 408 __init style functions doing IO (.probe & builtin/module_ > > > > _platform_driver_probe excluded) for 5.15 with allmodconfig: > > > > ['doc200x_ident_chip', > > 'doc_probe', 'doc2001_init', 'mtd_speedtest_init', > > 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init', > > 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551', > > 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx', > > 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom', > > 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next', > > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one', > > Um. ARM? Which architecture is this build for? The list of smatch IO findings is built for x86, but the smatch cross function database covers all archs, so when queried for all potential function callers, it would show non x86 arch call chains also. Here is the original x86 finding and call chain for the 'arm_v7s_do_selftests': Detected low-level IO from arm_v7s_do_selftests in fun __iommu_queue_command_sync drivers/iommu/amd/iommu.c:1025 __iommu_queue_command_sync() error: {15002074744551330002} 'check_host_input' read from the host using function 'readl' to a member of the structure 'iommu->cmd_buf_head'; __iommu_queue_command_sync() iommu_completion_wait() amd_iommu_domain_flush_complete() iommu_v1_map_page() arm_v7s_do_selftests() So, the results can be further filtered if you want a specified arch.
On Thu, Oct 14, 2021 at 07:27:42AM +0000, Reshetova, Elena wrote: > > On Thu, Oct 14, 2021 at 06:32:32AM +0000, Reshetova, Elena wrote: > > > > On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote: > > > > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and > > > > > > others) in init functions that also register drivers (thanks Elena for > > > > > > the number) > > > > > > > > > > To provide more numbers on this. What I can see so far from a smatch-based > > > > > analysis, we have 409 __init style functions (.probe & builtin/module_ > > > > > _platform_driver_probe excluded) for 5.15 with allyesconfig. > > > > > > > > I don't think we care about allyesconfig at all though. > > > > Just don't do that. > > > > How about allmodconfig? This is closer to what distros actually do. > > > > > > It does not make any difference really for the content of the /drivers/*: > > > gives 408 __init style functions doing IO (.probe & builtin/module_ > > > > > _platform_driver_probe excluded) for 5.15 with allmodconfig: > > > > > > ['doc200x_ident_chip', > > > 'doc_probe', 'doc2001_init', 'mtd_speedtest_init', > > > 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init', > > > 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551', > > > 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx', > > > 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom', > > > 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next', > > > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one', > > > > Um. ARM? Which architecture is this build for? > > The list of smatch IO findings is built for x86, but the smatch cross function > database covers all archs, so when queried for all potential function callers, > it would show non x86 arch call chains also. > > Here is the original x86 finding and call chain for the 'arm_v7s_do_selftests': > > Detected low-level IO from arm_v7s_do_selftests in fun > __iommu_queue_command_sync > > drivers/iommu/amd/iommu.c:1025 __iommu_queue_command_sync() error: > {15002074744551330002} > 'check_host_input' read from the host using function 'readl' to a > member of the structure 'iommu->cmd_buf_head'; > > __iommu_queue_command_sync() > iommu_completion_wait() > amd_iommu_domain_flush_complete() > iommu_v1_map_page() > arm_v7s_do_selftests() > > So, the results can be further filtered if you want a specified arch. So what is it just for x86? Could you tell?
On Thu, Oct 14, 2021 at 07:27:42AM +0000, Reshetova, Elena wrote: > > On Thu, Oct 14, 2021 at 06:32:32AM +0000, Reshetova, Elena wrote: > > > > On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote: > > > > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and > > > > > > others) in init functions that also register drivers (thanks Elena for > > > > > > the number) > > > > > > > > > > To provide more numbers on this. What I can see so far from a smatch-based > > > > > analysis, we have 409 __init style functions (.probe & builtin/module_ > > > > > _platform_driver_probe excluded) for 5.15 with allyesconfig. > > > > > > > > I don't think we care about allyesconfig at all though. > > > > Just don't do that. > > > > How about allmodconfig? This is closer to what distros actually do. > > > > > > It does not make any difference really for the content of the /drivers/*: > > > gives 408 __init style functions doing IO (.probe & builtin/module_ > > > > > _platform_driver_probe excluded) for 5.15 with allmodconfig: > > > > > > ['doc200x_ident_chip', > > > 'doc_probe', 'doc2001_init', 'mtd_speedtest_init', > > > 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init', > > > 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551', > > > 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx', > > > 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom', > > > 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next', > > > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one', > > > > Um. ARM? Which architecture is this build for? > > The list of smatch IO findings is built for x86, but the smatch cross function > database covers all archs, so when queried for all potential function callers, > it would show non x86 arch call chains also. > > Here is the original x86 finding and call chain for the 'arm_v7s_do_selftests': > > Detected low-level IO from arm_v7s_do_selftests in fun > __iommu_queue_command_sync > > drivers/iommu/amd/iommu.c:1025 __iommu_queue_command_sync() error: > {15002074744551330002} > 'check_host_input' read from the host using function 'readl' to a > member of the structure 'iommu->cmd_buf_head'; > > __iommu_queue_command_sync() > iommu_completion_wait() > amd_iommu_domain_flush_complete() > iommu_v1_map_page() > arm_v7s_do_selftests() > > So, the results can be further filtered if you want a specified arch. Even better would be a typical distro build.
> On Thu, Oct 14, 2021 at 07:27:42AM +0000, Reshetova, Elena wrote: > > > On Thu, Oct 14, 2021 at 06:32:32AM +0000, Reshetova, Elena wrote: > > > > > On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote: > > > > > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and > > > > > > > others) in init functions that also register drivers (thanks Elena for > > > > > > > the number) > > > > > > > > > > > > To provide more numbers on this. What I can see so far from a smatch- > based > > > > > > analysis, we have 409 __init style functions (.probe & builtin/module_ > > > > > > _platform_driver_probe excluded) for 5.15 with allyesconfig. > > > > > > > > > > I don't think we care about allyesconfig at all though. > > > > > Just don't do that. > > > > > How about allmodconfig? This is closer to what distros actually do. > > > > > > > > It does not make any difference really for the content of the /drivers/*: > > > > gives 408 __init style functions doing IO (.probe & builtin/module_ > > > > > > _platform_driver_probe excluded) for 5.15 with allmodconfig: > > > > > > > > ['doc200x_ident_chip', > > > > 'doc_probe', 'doc2001_init', 'mtd_speedtest_init', > > > > 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init', > > > > 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551', > > > > 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx', > > > > 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom', > > > > 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next', > > > > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one', > > > > > > Um. ARM? Which architecture is this build for? > > > > The list of smatch IO findings is built for x86, but the smatch cross function > > database covers all archs, so when queried for all potential function callers, > > it would show non x86 arch call chains also. > > > > Here is the original x86 finding and call chain for the 'arm_v7s_do_selftests': > > > > Detected low-level IO from arm_v7s_do_selftests in fun > > __iommu_queue_command_sync > > > > drivers/iommu/amd/iommu.c:1025 __iommu_queue_command_sync() error: > > {15002074744551330002} > > 'check_host_input' read from the host using function 'readl' to a > > member of the structure 'iommu->cmd_buf_head'; > > > > __iommu_queue_command_sync() > > iommu_completion_wait() > > amd_iommu_domain_flush_complete() > > iommu_v1_map_page() > > arm_v7s_do_selftests() > > > > So, the results can be further filtered if you want a specified arch. > > So what is it just for x86? Could you tell? I can probably figure out how to do additional filtering here, but does it really matter for the case that is being discussed here? Andi's point was that there quite many existing places in the kernel when low-level IO happens before the probe stage. So I brought these numbers here. What do you plan to do with the pure x86 results? And here are the full results for allyesconfig, if anyone is interested (just got permission to create the repository today): https://github.com/intel/ccc-linux-guest-hardening/tree/master/audit/sample_output/5.15-rc1 We will be pushing to this repo all the scripts and fuzzing setups we use as part of our Linux guest hardening effort for confidential cloud computing, but it is going to take some time to get all the approvals collected. Best Regards, Elena.
Elena, On Thu, Oct 14 2021 at 06:32, Elena Reshetova wrote: >> On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote: > It does not make any difference really for the content of the /drivers/*: > gives 408 __init style functions doing IO (.probe & builtin/module_ >> > _platform_driver_probe excluded) for 5.15 with allmodconfig: > > ['doc200x_ident_chip', > 'doc_probe', 'doc2001_init', 'mtd_speedtest_init', > 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init', > 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551', > 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx', > 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom', > 'ubi_gluebi_init', 'ubiblock_init' > 'ubi_init', 'mtd_stresstest_init', All of this is MTD and can just be disabled wholesale. Aside of that, most of these depend on either platform devices or device tree enumerations which are not ever available on X86. > 'probe_acpi_namespace_devices', > 'amd_iommu_init_pci', 'state_next', > 'init_dmars', 'iommu_init_pci', 'early_amd_iommu_init', > 'late_iommu_features_init', 'detect_ivrs', > 'intel_prepare_irq_remapping', 'intel_enable_irq_remapping', > 'intel_cleanup_irq_remapping', 'detect_intel_iommu', > 'parse_ioapics_under_ir', 'si_domain_init', > 'intel_iommu_init', 'dmar_table_init', > 'enable_drhd_fault_handling', > 'check_tylersburg_isoch', None of this is reachable because the initial detection which is ACPI table based will fail for TDX. If not, it's a guest firmware problem. > 'fb_console_init', 'xenbus_probe_backend_init', > 'xenbus_probe_frontend_init', 'setup_vcpu_hotplug_event', > 'balloon_init', XEN, that's relevant because magically the TDX guest will assume that it is a XEN instance? > 'ostm_init_clksrc', 'ftm_clockevent_init', 'ftm_clocksource_init', > 'kona_timer_init', 'mtk_gpt_init', 'samsung_clockevent_init', > 'samsung_clocksource_init', 'sysctr_timer_init', 'mxs_timer_init', > 'sun4i_timer_init', 'at91sam926x_pit_dt_init', 'owl_timer_init', > 'sun5i_setup_clockevent', > 'mt7621_clk_init', > 'samsung_clk_register_mux', 'samsung_clk_register_gate', > 'samsung_clk_register_fixed_rate', 'clk_boston_setup', > 'gemini_cc_init', 'aspeed_ast2400_cc', 'aspeed_ast2500_cc', > 'sun6i_rtc_clk_init', 'phy_init', 'ingenic_ost_register_clock', > 'meson6_timer_init', 'atcpit100_timer_init', > 'npcm7xx_clocksource_init', 'clksrc_dbx500_prcmu_init', > 'rcar_sysc_pd_setup', 'r8a779a0_sysc_pd_setup', 'renesas_soc_init', > 'rcar_rst_init', 'rmobile_setup_pm_domain', 'mcp_write_pairing_set', > 'a72_b53_rac_enable_all', 'mcp_a72_b53_set', > 'brcmstb_soc_device_early_init', 'imx8mq_soc_revision', > 'imx8mm_soc_uid', 'imx8mm_soc_revision', 'qe_init', > 'exynos5x_clk_init', 'exynos5250_clk_init', 'exynos4_get_xom', > 'create_one_cmux', 'create_one_pll', 'p2041_init_periph', > 'p4080_init_periph', 'p5020_init_periph', 'p5040_init_periph', > 'r9a06g032_clocks_probe', 'r8a73a4_cpg_clocks_init', > 'sh73a0_cpg_clocks_init', 'cpg_div6_register', > 'r8a7740_cpg_clocks_init', 'cpg_mssr_register_mod_clk', > 'cpg_mssr_register_core_clk', 'rcar_gen3_cpg_clk_register', > 'cpg_sd_clk_register', 'r7s9210_update_clk_table', > 'rz_cpg_read_mode_pins', 'rz_cpg_clocks_init', > 'rcar_r8a779a0_cpg_clk_register', 'rcar_gen2_cpg_clk_register', > 'sun8i_a33_ccu_setup', 'sun8i_a23_ccu_setup', 'sun5i_ccu_init', > 'suniv_f1c100s_ccu_setup', 'sun6i_a31_ccu_setup', > 'sun8i_v3_v3s_ccu_init', 'sun50i_h616_ccu_setup', > 'sunxi_h3_h5_ccu_init', 'sun4i_ccu_init', 'kona_ccu_init', > 'ns2_genpll_scr_clk_init', 'ns2_genpll_sw_clk_init', > 'ns2_lcpll_ddr_clk_init', 'ns2_lcpll_ports_clk_init', > 'nsp_genpll_clk_init', 'nsp_lcpll0_clk_init', > 'cygnus_genpll_clk_init', 'cygnus_lcpll0_clk_init', > 'cygnus_mipipll_clk_init', 'cygnus_audiopll_clk_init', > 'of_fixed_mmio_clk_setup', > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one', ARM based drivers are initialized on x86 in which way? > 'hv_init_tsc_clocksource', 'hv_init_clocksource', HyperV. See XEN > 'skx_init', > 'i10nm_init', 'sbridge_init', 'i82975x_init', 'i3000_init', > 'x38_init', 'ie31200_init', 'i3200_init', 'amd64_edac_init', > 'pnd2_init', 'edac_init', 'adummy_init', EDAC has already hypervisor checks > 'init_acpi_pm_clocksource', Requires ACPI table entry or command line override > 'intel_rng_mod_init', Has an old style PCI table which is searched via pci_get_device(). Could do with a cleanup which converts it to proper PCI probing. <SNIP> So I stop here, because it would be way simpler to have the file names but so far I could identify all of it from the top of my head. So what are you trying to tell me? That you found tons of ioremaps in __init functions which are completely irrelevant. Please stop making arguments based on completely nonsensical data. It took me less than 5 minutes to eliminate more than 50% of that list and I'm pretty sure that I could have eliminated the bulk of the rest as well. The fact that a large part of this is ARM only, the fact that nobody bothered to look at how e.g. IOMMU detection works and whether those ioremaps actually can't be reached is hillarious. So of these 400 instances are at least 30% ARM specific and those cannot be reached on ARM nilly willy either because they are either device tree or ACPI enumerated. Claiming that it is soo much work to analyze 400 at least to the point: - whether they are relevant for x86 and therefore potentially TDX at all - whether they have some form of enumeration or detection which makes the ioremaps unreachable when the trusted BIOS is implemented correctly Ijust can laugh at that, really: Two of my engineers have done an inventory of hundreds of cpu hotplug notifier instances in a couple of days some years ago. Ditto for a couple of hundred seqcount and a couple of hundred tasklet usage sites. Sure, but it makes more security handwaving and a nice presentation to tell people how much unsecure code there is based on half thought out static analysis. To do a proper static analysis of this, you really have to do a proper brain based analysis first of: 1) Which code is relevant for x86 2) What are the mechanisms which are used across the X86 relevant driver space to make these ioremap/MSR accesses actually reachable. And of course this will not be complete, but this eliminates the vast majority of your list. And looking at the remaining ones is not rocket science either. I can't take that serious at all. Come back when you have a properly compiled list of drivers which: 1) Can even be built for X86 2) Do ioremap/MSR based poking unconditionally. 3) Cannot be easily guarded off at the subsystem level It's not going to be a huge list. Then we can talk about facts and talk about the work required to fix them or blacklist them in some way. Thanks, tglx
On Thu, Oct 14, 2021 at 12:33:49PM +0000, Reshetova, Elena wrote: > > On Thu, Oct 14, 2021 at 07:27:42AM +0000, Reshetova, Elena wrote: > > > > On Thu, Oct 14, 2021 at 06:32:32AM +0000, Reshetova, Elena wrote: > > > > > > On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote: > > > > > > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and > > > > > > > > others) in init functions that also register drivers (thanks Elena for > > > > > > > > the number) > > > > > > > > > > > > > > To provide more numbers on this. What I can see so far from a smatch- > > based > > > > > > > analysis, we have 409 __init style functions (.probe & builtin/module_ > > > > > > > _platform_driver_probe excluded) for 5.15 with allyesconfig. > > > > > > > > > > > > I don't think we care about allyesconfig at all though. > > > > > > Just don't do that. > > > > > > How about allmodconfig? This is closer to what distros actually do. > > > > > > > > > > It does not make any difference really for the content of the /drivers/*: > > > > > gives 408 __init style functions doing IO (.probe & builtin/module_ > > > > > > > _platform_driver_probe excluded) for 5.15 with allmodconfig: > > > > > > > > > > ['doc200x_ident_chip', > > > > > 'doc_probe', 'doc2001_init', 'mtd_speedtest_init', > > > > > 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init', > > > > > 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551', > > > > > 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx', > > > > > 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom', > > > > > 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next', > > > > > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one', > > > > > > > > Um. ARM? Which architecture is this build for? > > > > > > The list of smatch IO findings is built for x86, but the smatch cross function > > > database covers all archs, so when queried for all potential function callers, > > > it would show non x86 arch call chains also. > > > > > > Here is the original x86 finding and call chain for the 'arm_v7s_do_selftests': > > > > > > Detected low-level IO from arm_v7s_do_selftests in fun > > > __iommu_queue_command_sync > > > > > > drivers/iommu/amd/iommu.c:1025 __iommu_queue_command_sync() error: > > > {15002074744551330002} > > > 'check_host_input' read from the host using function 'readl' to a > > > member of the structure 'iommu->cmd_buf_head'; > > > > > > __iommu_queue_command_sync() > > > iommu_completion_wait() > > > amd_iommu_domain_flush_complete() > > > iommu_v1_map_page() > > > arm_v7s_do_selftests() > > > > > > So, the results can be further filtered if you want a specified arch. > > > > So what is it just for x86? Could you tell? > > I can probably figure out how to do additional filtering here, but does > it really matter for the case that is being discussed here? Andi's point was > that there quite many existing places in the kernel when low-level IO > happens before the probe stage. So I brought these numbers here. > What do you plan to do with the pure x86 results? If the list is short - just suggest securing that ;) > And here are the full results for allyesconfig, if anyone is interested (just got permission to create > the repository today): > https://github.com/intel/ccc-linux-guest-hardening/tree/master/audit/sample_output/5.15-rc1 > We will be pushing to this repo all the scripts and fuzzing setups we use as part of > our Linux guest hardening effort for confidential cloud computing, but it is going to take > some time to get all the approvals collected. > > Best Regards, > Elena.
Andi, On Sun, Oct 10 2021 at 15:11, Andi Kleen wrote: > On 10/9/2021 1:39 PM, Dan Williams wrote: >> I agree with you and Greg here. If a driver is accessing hardware >> resources outside of the bind lifetime of one of the devices it >> supports, and in a way that neither modrobe-policy nor >> device-authorization -policy infrastructure can block, that sounds >> like a bug report. > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and > others) in init functions that also register drivers (thanks Elena for > the number) These numbers are completely useless simply because they are based on nonsensical criteria. See: https://lore.kernel.org/r/87r1cj2uad.ffs@tglx > My point is just that the ecosystem of devices that Linux supports is > messy enough that there are legitimate exceptions from the "First IO > only in probe call only" rule. Your point is based on your outright refusal to actualy do a proper analysis and your outright refusal to help fixing the real problems. All you have provided so far is handwaving based on a completely useless analysis. Sure, your goal is to get this TDX problem solved, but it's not going to be solved by: 1) Providing a nonsensical analysis 2) Using #1 as an argument to hack some half baken interfaces into the kernel which allow you to tick off your checkbox and then leave the resulting mess for others to clean up. Try again when you have factual data to back up your claims and factual arguments which prove that the problem can't be fixed otherwise. I might be repeating myself, but kernel development works this way: 1) Hack your private POC - Yay! 2) Sit down and think hard about the problems you identified in step #1. Do a thorough analysis. 3) Come up with a sensible integration plan. 4) Do the necessary grump work of cleanups all over the place 5) Add sensible infrastructure which is understandable for the bulk of kernel/driver developers 6) Let your feature fall in place and not in the way you are insisting on: 1) Hack your private POC - Yay! 2) Define that this is the only way to do it and try to shove it down the throat of everyone. 3) Getting told that this is not the way it works 4) Insist on it forever and blame the grumpy maintainers who are just not understanding the great value of your approach. 5) Go back to #2 You should know that already, but I have no problem to give that lecture to you over and over again. I probably should create a form letter. And no, you can bitch about me as much as you want. These are not my personal rules and personal pet pieves. These are rules Linus cares about very much and aside of that they just reflect common sense. The kernel is a common good and not the dump ground for your personal brain waste. The kernel does not serve Intel. Quite the contrary Intel depends on the kernel to work nicely with it's hardware. Ergo, Intel should have a vested interest to serve the kernel and take responsibility for it as a whole. And so should you as an Intel employee. Just dumping your next half baken workaround does not cut it especially not when it is not backed up by sensible arguments. Please try again, but not before you have something substantial to back up your claims. Thanks, Thomas
On Mon, Oct 18 2021 at 02:55, Thomas Gleixner wrote: > On Sun, Oct 10 2021 at 15:11, Andi Kleen wrote: >> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and >> others) in init functions that also register drivers (thanks Elena for >> the number) > > These numbers are completely useless simply because they are based on > nonsensical criteria. See: > > https://lore.kernel.org/r/87r1cj2uad.ffs@tglx > >> My point is just that the ecosystem of devices that Linux supports is >> messy enough that there are legitimate exceptions from the "First IO >> only in probe call only" rule. > > Your point is based on your outright refusal to actualy do a proper > analysis and your outright refusal to help fixing the real problems. > > All you have provided so far is handwaving based on a completely useless > analysis. > > Sure, your goal is to get this TDX problem solved, but it's not going to > be solved by: > > 1) Providing a nonsensical analysis > > 2) Using #1 as an argument to hack some half baken interfaces into the > kernel which allow you to tick off your checkbox and then leave the > resulting mess for others to clean up. > > Try again when you have factual data to back up your claims and factual > arguments which prove that the problem can't be fixed otherwise. > > I might be repeating myself, but kernel development works this way: > > 1) Hack your private POC - Yay! > > 2) Sit down and think hard about the problems you identified in step > #1. Do a thorough analysis. > > 3) Come up with a sensible integration plan. > > 4) Do the necessary grump work of cleanups all over the place > > 5) Add sensible infrastructure which is understandable for the bulk > of kernel/driver developers > > 6) Let your feature fall in place > > and not in the way you are insisting on: > > 1) Hack your private POC - Yay! > > 2) Define that this is the only way to do it and try to shove it down > the throat of everyone. > > 3) Getting told that this is not the way it works > > 4) Insist on it forever and blame the grumpy maintainers who are just > not understanding the great value of your approach. > > 5) Go back to #2 > > You should know that already, but I have no problem to give that lecture > to you over and over again. I probably should create a form letter. > > And no, you can bitch about me as much as you want. These are not my > personal rules and personal pet pieves. These are rules Linus cares > about very much and aside of that they just reflect common sense. > > The kernel is a common good and not the dump ground for your personal > brain waste. > > The kernel does not serve Intel. Quite the contrary Intel depends on > the kernel to work nicely with it's hardware. Ergo, Intel should have > a vested interest to serve the kernel and take responsibility for it > as a whole. And so should you as an Intel employee. > > Just dumping your next half baken workaround does not cut it especially > not when it is not backed up by sensible arguments. > > Please try again, but not before you have something substantial to back > up your claims. That said, I can't resist the urge to say a few words to the responsible senior and management people at Intel in this context: I surely know that a lot of Intel people claim that their lack of progress is _only_ because Thomas is hard to work with and Thomas wants unreasonable changes to their code, which I could perceive as an abuse of myself for the purpose of self-deception. TBH, I don't give a damn. Let me ask a few questions instead: - Is it unreasonable to expect that argumentations are based on facts and proper analysis? - Is it unreasonable to expect a proper integration of a new feature? - Does it take unreasonable effort to do a proper design? - Is it unreasonable to ask that he necessary cleanups are done upfront? If anyone of the responsible people at Intel thinks so, then they should speak up now and tell me in public and into my face what's so unreasonable about that. Thanks, Thomas
Thank Thomas for your review, please see my responses/comments inline. > > 'ostm_init_clksrc', 'ftm_clockevent_init', 'ftm_clocksource_init', > > 'kona_timer_init', 'mtk_gpt_init', 'samsung_clockevent_init', > > 'samsung_clocksource_init', 'sysctr_timer_init', 'mxs_timer_init', > > 'sun4i_timer_init', 'at91sam926x_pit_dt_init', 'owl_timer_init', > > 'sun5i_setup_clockevent', > > 'mt7621_clk_init', > > 'samsung_clk_register_mux', 'samsung_clk_register_gate', > > 'samsung_clk_register_fixed_rate', 'clk_boston_setup', > > 'gemini_cc_init', 'aspeed_ast2400_cc', 'aspeed_ast2500_cc', > > 'sun6i_rtc_clk_init', 'phy_init', 'ingenic_ost_register_clock', > > 'meson6_timer_init', 'atcpit100_timer_init', > > 'npcm7xx_clocksource_init', 'clksrc_dbx500_prcmu_init', > > 'rcar_sysc_pd_setup', 'r8a779a0_sysc_pd_setup', 'renesas_soc_init', > > 'rcar_rst_init', 'rmobile_setup_pm_domain', 'mcp_write_pairing_set', > > 'a72_b53_rac_enable_all', 'mcp_a72_b53_set', > > 'brcmstb_soc_device_early_init', 'imx8mq_soc_revision', > > 'imx8mm_soc_uid', 'imx8mm_soc_revision', 'qe_init', > > 'exynos5x_clk_init', 'exynos5250_clk_init', 'exynos4_get_xom', > > 'create_one_cmux', 'create_one_pll', 'p2041_init_periph', > > 'p4080_init_periph', 'p5020_init_periph', 'p5040_init_periph', > > 'r9a06g032_clocks_probe', 'r8a73a4_cpg_clocks_init', > > 'sh73a0_cpg_clocks_init', 'cpg_div6_register', > > 'r8a7740_cpg_clocks_init', 'cpg_mssr_register_mod_clk', > > 'cpg_mssr_register_core_clk', 'rcar_gen3_cpg_clk_register', > > 'cpg_sd_clk_register', 'r7s9210_update_clk_table', > > 'rz_cpg_read_mode_pins', 'rz_cpg_clocks_init', > > 'rcar_r8a779a0_cpg_clk_register', 'rcar_gen2_cpg_clk_register', > > 'sun8i_a33_ccu_setup', 'sun8i_a23_ccu_setup', 'sun5i_ccu_init', > > 'suniv_f1c100s_ccu_setup', 'sun6i_a31_ccu_setup', > > 'sun8i_v3_v3s_ccu_init', 'sun50i_h616_ccu_setup', > > 'sunxi_h3_h5_ccu_init', 'sun4i_ccu_init', 'kona_ccu_init', > > 'ns2_genpll_scr_clk_init', 'ns2_genpll_sw_clk_init', > > 'ns2_lcpll_ddr_clk_init', 'ns2_lcpll_ports_clk_init', > > 'nsp_genpll_clk_init', 'nsp_lcpll0_clk_init', > > 'cygnus_genpll_clk_init', 'cygnus_lcpll0_clk_init', > > 'cygnus_mipipll_clk_init', 'cygnus_audiopll_clk_init', > > 'of_fixed_mmio_clk_setup', > > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one', > > ARM based drivers are initialized on x86 in which way? As I already explained to Michael the reason why ARM code is included into this list is the fact that that smatch cross function database contains all code paths, so when querying up for the possible ones you get everything. I will filter this list to x86 and provide results since this seems to be important. The reason why I don't see this as important is that the threat model we are trying to protect against here (malicious VMM/host) is not TDX specific and I see no reason why in some near or further future ARM arch would also have a confidential cloud computing solution and they would need to do exactly the same thing. > > > 'hv_init_tsc_clocksource', 'hv_init_clocksource', > > HyperV. See XEN > > > 'skx_init', > > 'i10nm_init', 'sbridge_init', 'i82975x_init', 'i3000_init', > > 'x38_init', 'ie31200_init', 'i3200_init', 'amd64_edac_init', > > 'pnd2_init', 'edac_init', 'adummy_init', > > EDAC has already hypervisor checks > > > 'init_acpi_pm_clocksource', > > Requires ACPI table entry or command line override > > > 'intel_rng_mod_init', > > Has an old style PCI table which is searched via pci_get_device(). Could > do with a cleanup which converts it to proper PCI probing. > > <SNIP> > > So I stop here, because it would be way simpler to have the file names > but so far I could identify all of it from the top of my head. > > So what are you trying to tell me? That you found tons of ioremaps in > __init functions which are completely irrelevant. > > Please stop making arguments based on completely nonsensical data. It > took me less than 5 minutes to eliminate more than 50% of that list and > I'm pretty sure that I could have eliminated the bulk of the rest as > well. > > The fact that a large part of this is ARM only, the fact that nobody > bothered to look at how e.g. IOMMU detection works and whether those > ioremaps actually can't be reached is hillarious. > > So of these 400 instances are at least 30% ARM specific and those > cannot be reached on ARM nilly willy either because they are either > device tree or ACPI enumerated. > > Claiming that it is soo much work to analyze 400 at least to the point: Please bear in mind that the 400 function list is not complete by any means. Many drivers define driver-specific wrappers for low-level IO and then use these wrappers through their code. For the drivers we have audited (like virtIO) we have manually read the code of each driver, identified these wrapper functions (like virtio_cread*) and then added them into the static analyzer to track the all the invocations of these functions. How do you propose to repeat this exercise for all the Linux kernel driver/module codebase? > > - whether they are relevant for x86 and therefore potentially TDX at > all > > - whether they have some form of enumeration or detection which makes > the ioremaps unreachable when the trusted BIOS is implemented > correctly > > Ijust can laugh at that, really: > > Two of my engineers have done an inventory of hundreds of cpu hotplug > notifier instances in a couple of days some years ago. Ditto for a > couple of hundred seqcount and a couple of hundred tasklet usage > sites. > > Sure, but it makes more security handwaving and a nice presentation to > tell people how much unsecure code there is based on half thought out > static analysis. To do a proper static analysis of this, you really > have to do a proper brain based analysis first of: > > 1) Which code is relevant for x86 > > 2) What are the mechanisms which are used across the X86 relevant > driver space to make these ioremap/MSR accesses actually reachable. > > And of course this will not be complete, but this eliminates the vast > majority of your list. And looking at the remaining ones is not rocket > science either. > > I can't take that serious at all. Come back when you have a properly > compiled list of drivers which: > > 1) Can even be built for X86 > > 2) Do ioremap/MSR based poking unconditionally. > > 3) Cannot be easily guarded off at the subsystem level I see two main problems with this approach (in addition to the above fact that we need to spend *a lot of time* building the complete list of such functions first): 1. It is very error prone since it would involve a lot of manual code audit done by humans. And here each mistake is a potential new CVE for the kernel in the scope of confidential computing threat model. 2. It would require a lot of expertise from people who would want to run a secure protected guest kernel to make sure that their particular setup is secure. Essentially they would need to completely repeat the hardening exercise from scratch and verify all the involved code paths to make sure for their build, certain code is indeed disabled, guarded at the subsystem level, not reachable because of cupid bits, etc. etc. Not everyone has resources to do such an analysis (I would say little do), so we will end up with a lot of unsecure production kernels, because time to market pressure would win over the security if doing it means so much work. Speaking in security terms what you propose is to start from day one analyzing the whole existing and waste attack surface, fix all the security issues in it manually one by one and then somewhere in 20 years from now be done with it (or maybe never because there is always new code coming in, and some would introduce new problems). What we are proposing is first to try to minimize the attack surface as much as possible with a simple and well understood set of controls and then spend realistic amount of time securing this minimized surface. Again, this is not TDX specific attack surface, but generic to any guest kernel that wants to be secure under confidential cloud computing threat model. So, it is not us who is pushing smth into the kernel for the sake of TDX, but we seems to be the first ones so far who cares about the whole picture and not just to provide HW means to run a protected guest. And given that most of the drivers have never been written with this confidential cloud computing threat model in mind, it is going to take time to fix all of them. This really should be a community effort and a long-running task. Take a look on this paper for example: https://arxiv.org/pdf/2109.10660.pdf They have tried to fuzz just small set of 22 drivers from this attack surface and found 50 security relevant bugs. And fuzzing is of course no ultimate security testing technique. So, I don't see how without the drastically reducing the scope of security hardening first we can end up with a secure setup. And then as the time goes and more people looking into this attack surface, we can (hopefully) gradually open it up. Best Regards, Elena.
On Sun, Oct 10, 2021 at 03:11:23PM -0700, Andi Kleen wrote: > > On 10/9/2021 1:39 PM, Dan Williams wrote: > > On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin <mst@redhat.com> wrote: > > > On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote: > > > > From: Andi Kleen <ak@linux.intel.com> > > > > > > > > For Confidential VM guests like TDX, the host is untrusted and hence > > > > the devices emulated by the host or any data coming from the host > > > > cannot be trusted. So the drivers that interact with the outside world > > > > have to be hardened by sharing memory with host on need basis > > > > with proper hardening fixes. > > > > > > > > For the PCI driver case, to share the memory with the host add > > > > pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs. > > > > > > > > Signed-off-by: Andi Kleen <ak@linux.intel.com> > > > > Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> > > > So I proposed to make all pci mappings shared, eliminating the need > > > to patch drivers. > > > > > > To which Andi replied > > > One problem with removing the ioremap opt-in is that > > > it's still possible for drivers to get at devices without going through probe. > > > > > > To which Greg replied: > > > https://lore.kernel.org/all/YVXBNJ431YIWwZdQ@kroah.com/ > > > If there are in-kernel PCI drivers that do not do this, they need to be > > > fixed today. > > > > > > Can you guys resolve the differences here? > > I agree with you and Greg here. If a driver is accessing hardware > > resources outside of the bind lifetime of one of the devices it > > supports, and in a way that neither modrobe-policy nor > > device-authorization -policy infrastructure can block, that sounds > > like a bug report. > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and > others) in init functions that also register drivers (thanks Elena for the > number) > > Some are probably old drivers that could be fixed, but it's quite a few > legitimate cases. For example for platform or ISA drivers that's the only > way they can be implemented because they often have no other enumeration > mechanism. For PCI drivers it's rarer, but also still can happen. One > example that comes to mind here is the x86 Intel uncore drivers, which > support a mix of MSR, ioremap and PCI config space accesses all from the > same driver. This particular example can (and should be) fixed in other > ways, but similar things also happen in other drivers, and they're not all > broken. Even for the broken ones they're usually for some crufty old devices > that has very few users, so it's likely untestable in practice. > > My point is just that the ecosystem of devices that Linux supports is messy > enough that there are legitimate exceptions from the "First IO only in probe > call only" rule. No, there should not be for PCI drivers. If there is, that is a bug that you can, and should, fix. > And we can't just fix them all. Even if we could it would be hard to > maintain. Not true at all, you can fix them, and write a simple coccinelle rule to prevent them from ever coming back in. > Using a "firewall model" hooking into a few strategic points like we're > proposing here is much saner for everyone. No it is not. It is "easier" for you because you all do not want to fix up all of the drivers and want to add additional code complexity on top of the current mess that we have and then you can claim that you have "hardened" the drivers you care about. Despite no one ever explaining exactly what "hardened" means to me. Again, fix the existing drivers, you have the whole source, if this is something that you all care about, it should not be hard to do. Stop making excuses. greg k-h
On Tue, Oct 12, 2021 at 11:35:04AM -0700, Andi Kleen wrote: > > I'd rather see more concerted efforts focused/limited core changes > > rather than leaf driver changes until there is a clearer definition of > > hardened. > > A hardened driver is a driver that Ah, you do define this, thank you! > - Had similar security (not API) oriented review of its IO operations > (mainly MMIO access, but also PCI config space) as a non privileged user > interface (like a ioctl). That review should be focused on memory safety. Where is this review done? Where is is documented? Who is responsible for keeping it up to date with every code change to the driver, and to the code that the driver calls and the code that calls the driver? > - Had some fuzzing on these IO interfaces using to be released tools. "some"? What tools? What is the input, and where is that defined? How much fuzzing do you claim is "good enough"? > Right now it's only three virtio drivers (console, net, block) Where was this work done and published? And why only 3? > Really it's no different than what we do for every new unprivileged user > interface. Really? I have seen loads of new drivers from Intel submitted in the past months that would fail any of the above things just based on obvious code reviews that I end up having to do... If you want to start a "hardened driver" effort, there's a lot of real work that needs to be done here and documented, and explained why it can not just be done for the whole kernel... greg k-h
On Mon, Oct 11, 2021 at 07:59:17AM -0400, Michael S. Tsirkin wrote: > On Sun, Oct 10, 2021 at 03:22:39PM -0700, Andi Kleen wrote: > > > > > To which Andi replied > > > One problem with removing the ioremap opt-in is that > > > it's still possible for drivers to get at devices without going through probe. > > > > > > To which Greg replied: > > > https://lore.kernel.org/all/YVXBNJ431YIWwZdQ@kroah.com/ > > > If there are in-kernel PCI drivers that do not do this, they need to be > > > fixed today. > > > > > > Can you guys resolve the differences here? > > > > > > I addressed this in my other mail, but we may need more discussion. > > Hopefully Greg will reply to that one. Note, when wanting Greg to reply, someone should at the very least cc: him. {sigh} greg k-h
On Mon, Oct 18, 2021 at 02:15:47PM +0200, Greg KH wrote: > On Mon, Oct 11, 2021 at 07:59:17AM -0400, Michael S. Tsirkin wrote: > > On Sun, Oct 10, 2021 at 03:22:39PM -0700, Andi Kleen wrote: > > > > > > > To which Andi replied > > > > One problem with removing the ioremap opt-in is that > > > > it's still possible for drivers to get at devices without going through probe. > > > > > > > > To which Greg replied: > > > > https://lore.kernel.org/all/YVXBNJ431YIWwZdQ@kroah.com/ > > > > If there are in-kernel PCI drivers that do not do this, they need to be > > > > fixed today. > > > > > > > > Can you guys resolve the differences here? > > > > > > > > > I addressed this in my other mail, but we may need more discussion. > > > > Hopefully Greg will reply to that one. > > Note, when wanting Greg to reply, someone should at the very least cc: > him. "that one" being "Andi's other mail". Which I don't remember what it was, by now. Sorry. > {sigh} > > greg k-h
diff --git a/include/asm-generic/pci_iomap.h b/include/asm-generic/pci_iomap.h index df636c6d8e6c..a4a83c8ab3cf 100644 --- a/include/asm-generic/pci_iomap.h +++ b/include/asm-generic/pci_iomap.h @@ -18,6 +18,12 @@ extern void __iomem *pci_iomap_range(struct pci_dev *dev, int bar, extern void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar, unsigned long offset, unsigned long maxlen); +extern void __iomem *pci_iomap_host_shared(struct pci_dev *dev, int bar, + unsigned long max); +extern void __iomem *pci_iomap_host_shared_range(struct pci_dev *dev, int bar, + unsigned long offset, + unsigned long maxlen); + /* Create a virtual mapping cookie for a port on a given PCI device. * Do not call this directly, it exists to make it easier for architectures * to override */ diff --git a/lib/pci_iomap.c b/lib/pci_iomap.c index 57bd92f599ee..2816dc8715da 100644 --- a/lib/pci_iomap.c +++ b/lib/pci_iomap.c @@ -25,6 +25,11 @@ static void __iomem *map_ioremap_wc(phys_addr_t addr, size_t size) return ioremap_wc(addr, size); } +static void __iomem *map_ioremap_host_shared(phys_addr_t addr, size_t size) +{ + return ioremap_host_shared(addr, size); +} + static void __iomem *pci_iomap_range_map(struct pci_dev *dev, int bar, unsigned long offset, @@ -106,6 +111,48 @@ void __iomem *pci_iomap_wc_range(struct pci_dev *dev, } EXPORT_SYMBOL_GPL(pci_iomap_wc_range); +/** + * pci_iomap_host_shared_range - create a virtual shared mapping cookie + * for a PCI BAR + * @dev: PCI device that owns the BAR + * @bar: BAR number + * @offset: map memory at the given offset in BAR + * @maxlen: max length of the memory to map + * + * Remap a pci device's resources shared in a confidential guest. + * For more details see pci_iomap_range's documentation. + * + * @maxlen specifies the maximum length to map. To get access to + * the complete BAR from offset to the end, pass %0 here. + */ +void __iomem *pci_iomap_host_shared_range(struct pci_dev *dev, int bar, + unsigned long offset, + unsigned long maxlen) +{ + return pci_iomap_range_map(dev, bar, offset, maxlen, + map_ioremap_host_shared, true); +} +EXPORT_SYMBOL_GPL(pci_iomap_host_shared_range); + +/** + * pci_iomap_host_shared - create a virtual shared mapping cookie for a PCI BAR + * @dev: PCI device that owns the BAR + * @bar: BAR number + * @maxlen: length of the memory to map + * + * See pci_iomap for details. This function creates a shared mapping + * with the host for confidential hosts. + * + * @maxlen specifies the maximum length to map. To get access to the + * complete BAR without checking for its length first, pass %0 here. + */ +void __iomem *pci_iomap_host_shared(struct pci_dev *dev, int bar, + unsigned long maxlen) +{ + return pci_iomap_host_shared_range(dev, bar, 0, maxlen); +} +EXPORT_SYMBOL_GPL(pci_iomap_host_shared); + /** * pci_iomap - create a virtual mapping cookie for a PCI BAR * @dev: PCI device that owns the BAR