diff mbox series

[v5,12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

Message ID 20211009003711.1390019-13-sathyanarayanan.kuppuswamy@linux.intel.com (mailing list archive)
State Not Applicable
Delegated to: Bjorn Helgaas
Headers show
Series Add TDX Guest Support (shared-mm support) | expand

Commit Message

Kuppuswamy Sathyanarayanan Oct. 9, 2021, 12:37 a.m. UTC
From: Andi Kleen <ak@linux.intel.com>

For Confidential VM guests like TDX, the host is untrusted and hence
the devices emulated by the host or any data coming from the host
cannot be trusted. So the drivers that interact with the outside world
have to be hardened by sharing memory with host on need basis
with proper hardening fixes.

For the PCI driver case, to share the memory with the host add
pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 Changes since v4:
 * Replaced "_shared" with "_host_shared" in pci_iomap* APIs
 * Fixed commit log as per review comments.

 include/asm-generic/pci_iomap.h |  6 +++++
 lib/pci_iomap.c                 | 47 +++++++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+)

Comments

Michael S. Tsirkin Oct. 9, 2021, 9:53 a.m. UTC | #1
On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> For Confidential VM guests like TDX, the host is untrusted and hence
> the devices emulated by the host or any data coming from the host
> cannot be trusted. So the drivers that interact with the outside world
> have to be hardened by sharing memory with host on need basis
> with proper hardening fixes.
> 
> For the PCI driver case, to share the memory with the host add
> pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

So I proposed to make all pci mappings shared, eliminating the need
to patch drivers.

To which Andi replied
	One problem with removing the ioremap opt-in is that
	it's still possible for drivers to get at devices without going through probe.

To which Greg replied:
https://lore.kernel.org/all/YVXBNJ431YIWwZdQ@kroah.com/
	If there are in-kernel PCI drivers that do not do this, they need to be
	fixed today.

Can you guys resolve the differences here?

And once they are resolved, mention this in the commit log so
I don't get to re-read the series just to find out nothing
changed in this respect?

I frankly do not believe we are anywhere near being able to harden
an arbitrary kernel config against attack.
How about creating a defconfig that makes sense for TDX then?
Anyone deviating from that better know what they are doing,
this API tweaking is just putting policy into the kernel  ...

> ---
>  Changes since v4:
>  * Replaced "_shared" with "_host_shared" in pci_iomap* APIs
>  * Fixed commit log as per review comments.
> 
>  include/asm-generic/pci_iomap.h |  6 +++++
>  lib/pci_iomap.c                 | 47 +++++++++++++++++++++++++++++++++
>  2 files changed, 53 insertions(+)
> 
> diff --git a/include/asm-generic/pci_iomap.h b/include/asm-generic/pci_iomap.h
> index df636c6d8e6c..a4a83c8ab3cf 100644
> --- a/include/asm-generic/pci_iomap.h
> +++ b/include/asm-generic/pci_iomap.h
> @@ -18,6 +18,12 @@ extern void __iomem *pci_iomap_range(struct pci_dev *dev, int bar,
>  extern void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar,
>  					unsigned long offset,
>  					unsigned long maxlen);
> +extern void __iomem *pci_iomap_host_shared(struct pci_dev *dev, int bar,
> +					   unsigned long max);
> +extern void __iomem *pci_iomap_host_shared_range(struct pci_dev *dev, int bar,
> +						 unsigned long offset,
> +						 unsigned long maxlen);
> +
>  /* Create a virtual mapping cookie for a port on a given PCI device.
>   * Do not call this directly, it exists to make it easier for architectures
>   * to override */
> diff --git a/lib/pci_iomap.c b/lib/pci_iomap.c
> index 57bd92f599ee..2816dc8715da 100644
> --- a/lib/pci_iomap.c
> +++ b/lib/pci_iomap.c
> @@ -25,6 +25,11 @@ static void __iomem *map_ioremap_wc(phys_addr_t addr, size_t size)
>  	return ioremap_wc(addr, size);
>  }
>  
> +static void __iomem *map_ioremap_host_shared(phys_addr_t addr, size_t size)
> +{
> +	return ioremap_host_shared(addr, size);
> +}
> +
>  static void __iomem *pci_iomap_range_map(struct pci_dev *dev,
>  					 int bar,
>  					 unsigned long offset,
> @@ -106,6 +111,48 @@ void __iomem *pci_iomap_wc_range(struct pci_dev *dev,
>  }
>  EXPORT_SYMBOL_GPL(pci_iomap_wc_range);
>  
> +/**
> + * pci_iomap_host_shared_range - create a virtual shared mapping cookie
> + *				 for a PCI BAR
> + * @dev: PCI device that owns the BAR
> + * @bar: BAR number
> + * @offset: map memory at the given offset in BAR
> + * @maxlen: max length of the memory to map
> + *
> + * Remap a pci device's resources shared in a confidential guest.
> + * For more details see pci_iomap_range's documentation.

So how does a driver author know when to use this function, and when to
use the regular pci_iomap_range?  Drivers have no idea whether they are
used in a confidential guest, and which ranges are shared, it's a TDX
thing ...

This documentation should really address it.

> + *
> + * @maxlen specifies the maximum length to map. To get access to
> + * the complete BAR from offset to the end, pass %0 here.
> + */
> +void __iomem *pci_iomap_host_shared_range(struct pci_dev *dev, int bar,
> +					  unsigned long offset,
> +					  unsigned long maxlen)
> +{
> +	return pci_iomap_range_map(dev, bar, offset, maxlen,
> +				   map_ioremap_host_shared, true);
> +}
> +EXPORT_SYMBOL_GPL(pci_iomap_host_shared_range);
> +
> +/**
> + * pci_iomap_host_shared - create a virtual shared mapping cookie for a PCI BAR
> + * @dev: PCI device that owns the BAR
> + * @bar: BAR number
> + * @maxlen: length of the memory to map
> + *
> + * See pci_iomap for details. This function creates a shared mapping
> + * with the host for confidential hosts.
> + *
> + * @maxlen specifies the maximum length to map. To get access to the
> + * complete BAR without checking for its length first, pass %0 here.
> + */
> +void __iomem *pci_iomap_host_shared(struct pci_dev *dev, int bar,
> +			       unsigned long maxlen)
> +{
> +	return pci_iomap_host_shared_range(dev, bar, 0, maxlen);
> +}
> +EXPORT_SYMBOL_GPL(pci_iomap_host_shared);
> +
>  /**
>   * pci_iomap - create a virtual mapping cookie for a PCI BAR
>   * @dev: PCI device that owns the BAR
> -- 
> 2.25.1
Dan Williams Oct. 9, 2021, 8:39 p.m. UTC | #2
On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote:
> > From: Andi Kleen <ak@linux.intel.com>
> >
> > For Confidential VM guests like TDX, the host is untrusted and hence
> > the devices emulated by the host or any data coming from the host
> > cannot be trusted. So the drivers that interact with the outside world
> > have to be hardened by sharing memory with host on need basis
> > with proper hardening fixes.
> >
> > For the PCI driver case, to share the memory with the host add
> > pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.
> >
> > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> > Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>
> So I proposed to make all pci mappings shared, eliminating the need
> to patch drivers.
>
> To which Andi replied
>         One problem with removing the ioremap opt-in is that
>         it's still possible for drivers to get at devices without going through probe.
>
> To which Greg replied:
> https://lore.kernel.org/all/YVXBNJ431YIWwZdQ@kroah.com/
>         If there are in-kernel PCI drivers that do not do this, they need to be
>         fixed today.
>
> Can you guys resolve the differences here?

I agree with you and Greg here. If a driver is accessing hardware
resources outside of the bind lifetime of one of the devices it
supports, and in a way that neither modrobe-policy nor
device-authorization -policy infrastructure can block, that sounds
like a bug report. Fix those drivers instead of sprinkling
ioremap_shared in select places and with unclear rules about when a
driver is allowed to do "shared" mappings. Let the new
device-authorization mechanism (with policy in userspace) be the
central place where all of these driver "trust" issues are managed.

> And once they are resolved, mention this in the commit log so
> I don't get to re-read the series just to find out nothing
> changed in this respect?
>
> I frankly do not believe we are anywhere near being able to harden
> an arbitrary kernel config against attack.
> How about creating a defconfig that makes sense for TDX then?
> Anyone deviating from that better know what they are doing,
> this API tweaking is just putting policy into the kernel  ...

Right, userspace authorization policy and select driver fixups seems
to be the answer to the raised concerns.
Andi Kleen Oct. 10, 2021, 10:11 p.m. UTC | #3
On 10/9/2021 1:39 PM, Dan Williams wrote:
> On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote:
>>> From: Andi Kleen <ak@linux.intel.com>
>>>
>>> For Confidential VM guests like TDX, the host is untrusted and hence
>>> the devices emulated by the host or any data coming from the host
>>> cannot be trusted. So the drivers that interact with the outside world
>>> have to be hardened by sharing memory with host on need basis
>>> with proper hardening fixes.
>>>
>>> For the PCI driver case, to share the memory with the host add
>>> pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.
>>>
>>> Signed-off-by: Andi Kleen <ak@linux.intel.com>
>>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>> So I proposed to make all pci mappings shared, eliminating the need
>> to patch drivers.
>>
>> To which Andi replied
>>          One problem with removing the ioremap opt-in is that
>>          it's still possible for drivers to get at devices without going through probe.
>>
>> To which Greg replied:
>> https://lore.kernel.org/all/YVXBNJ431YIWwZdQ@kroah.com/
>>          If there are in-kernel PCI drivers that do not do this, they need to be
>>          fixed today.
>>
>> Can you guys resolve the differences here?
> I agree with you and Greg here. If a driver is accessing hardware
> resources outside of the bind lifetime of one of the devices it
> supports, and in a way that neither modrobe-policy nor
> device-authorization -policy infrastructure can block, that sounds
> like a bug report.

The 5.15 tree has something like ~2.4k IO accesses (including MMIO and 
others) in init functions that also register drivers (thanks Elena for 
the number)

Some are probably old drivers that could be fixed, but it's quite a few 
legitimate cases. For example for platform or ISA drivers that's the 
only way they can be implemented because they often have no other 
enumeration mechanism. For PCI drivers it's rarer, but also still can 
happen. One example that comes to mind here is the x86 Intel uncore 
drivers, which support a mix of MSR, ioremap and PCI config space 
accesses all from the same driver. This particular example can (and 
should be) fixed in other ways, but similar things also happen in other 
drivers, and they're not all broken. Even for the broken ones they're 
usually for some crufty old devices that has very few users, so it's 
likely untestable in practice.

My point is just that the ecosystem of devices that Linux supports is 
messy enough that there are legitimate exceptions from the "First IO 
only in probe call only" rule.

And we can't just fix them all. Even if we could it would be hard to 
maintain.

Using a "firewall model" hooking into a few strategic points like we're 
proposing here is much saner for everyone.

Now we can argue about the details. Right now what we're proposing has 
some redundancies: it has both a device model filter and low level 
filter for ioremap (this patch and some others). The low level filter is 
for catching issues that don't clearly fit into the 
"enumeration<->probe" model. You could call that redundant, but I would 
call it defense in depth or better safe than sorry. In theory it would 
be enough to have the low level opt-in only, but that would have the 
drawback that is something gets enumerated after all you would have all 
kind of weird device driver failures and in some cases even killed 
guests. So I think it makes sense to have


> Fix those drivers instead of sprinkling
> ioremap_shared in select places and with unclear rules about when a
> driver is allowed to do "shared" mappings.

Only add it when the driver has been audited and hardened.

But I agree we need on a documented process for this. I will work on 
some documentation for a proposal. But essentially I think it should be 
some variant of what Elena has outlined in her talk at Security Summit.

https://static.sched.com/hosted_files/lssna2021/b6/LSS-HardeningLinuxGuestForCCC.pdf

That is using extra auditing/scrutiny at review time, supported with 
some static code analysis that points to the interaction points, and 
code needs to be fuzzed explicitly.

However short term it's only three virtio drivers, so this is not a 
urgent problem.

> Let the new
> device-authorization mechanism (with policy in userspace)


Default policy in user space just seems to be a bad idea here. Who 
should know if a driver is hardened other than the kernel? Maintaining 
the list somewhere else just doesn't make sense to me.

Also there is the more practical problem that some devices are needed 
for booting. For example in TDX we can't print something to the console 
with this mechanism, so you would never get any output before the 
initrd. Just seems like a nightmare for debugging anything. There really 
needs to be an authorization mechanism that works reasonably early.

I can see a point of having user space overrides though, but we need to 
have a sane kernel default that works early.

-Andi
Andi Kleen Oct. 10, 2021, 10:22 p.m. UTC | #4
> To which Andi replied
> 	One problem with removing the ioremap opt-in is that
> 	it's still possible for drivers to get at devices without going through probe.
>
> To which Greg replied:
> https://lore.kernel.org/all/YVXBNJ431YIWwZdQ@kroah.com/
> 	If there are in-kernel PCI drivers that do not do this, they need to be
> 	fixed today.
>
> Can you guys resolve the differences here?


I addressed this in my other mail, but we may need more discussion.


>
> And once they are resolved, mention this in the commit log so
> I don't get to re-read the series just to find out nothing
> changed in this respect?
>
> I frankly do not believe we are anywhere near being able to harden
> an arbitrary kernel config against attack.

Why not? Device filter and the opt-ins together are a fairly strong 
mechanism.

And it's not that they're a lot of code or super complicated either.

You're essentially objecting to a single line change in your subsystem here.


> How about creating a defconfig that makes sense for TDX then?

TDX can be used in many different ways, I don't think a defconfig is 
practical.

In theory you could do some Kconfig dependency (at the pain point of 
having separate kernel binariees), but why not just do it at run time 
then if you maintain the list anyways. That's much easier and saner for 
everyone. In the past we usually always ended up with runtime mechanism 
for similar things anyways.

Also it turns out that the filter mechanisms are needed for some arch 
drivers which are not even configurable, so alone it's probably not enough,


> Anyone deviating from that better know what they are doing,
> this API tweaking is just putting policy into the kernel  ...

Hardening drivers is kernel policy. It cannot be done anywhere else.


-Andi
Christoph Hellwig Oct. 11, 2021, 7:58 a.m. UTC | #5
Just as last time:  This does not make any sense.  ioremap is shared
by definition.
Michael S. Tsirkin Oct. 11, 2021, 11:59 a.m. UTC | #6
On Sun, Oct 10, 2021 at 03:22:39PM -0700, Andi Kleen wrote:
> 
> > To which Andi replied
> > 	One problem with removing the ioremap opt-in is that
> > 	it's still possible for drivers to get at devices without going through probe.
> > 
> > To which Greg replied:
> > https://lore.kernel.org/all/YVXBNJ431YIWwZdQ@kroah.com/
> > 	If there are in-kernel PCI drivers that do not do this, they need to be
> > 	fixed today.
> > 
> > Can you guys resolve the differences here?
> 
> 
> I addressed this in my other mail, but we may need more discussion.

Hopefully Greg will reply to that one.

> 
> > 
> > And once they are resolved, mention this in the commit log so
> > I don't get to re-read the series just to find out nothing
> > changed in this respect?
> > 
> > I frankly do not believe we are anywhere near being able to harden
> > an arbitrary kernel config against attack.
> 
> Why not? Device filter and the opt-ins together are a fairly strong
> mechanism.

Because it does not end with I/O operations, that's a trivial example.
module unloading is famous for being racy: I just re-read that part of
virtio drivers and sure enough we have bugs there, this is after
they have presumably been audited, so a TDX guest is better off
just disabling hot-unplug completely, and hotplug isn't far behind.
Malicious filesystems can exploit many linux systems unless
you take pains to limit what is mounted and how.
Networking devices tend to get into the default namespaces and can
do more or less whatever CAP_NET_ADMIN can.
Etc.
I am not saying this makes the effort worthless, I am saying userspace
better know very well what it's doing, and kernel better be
configured in a very specific way.

> And it's not that they're a lot of code or super complicated either.
> 
> You're essentially objecting to a single line change in your subsystem here.

Well I commented on the API patch, not the virtio patch.
If it's a way for a driver to say "I am hardened
and audited" then I guess it should at least say so. It has nothing
to do with host or sharing, that's an implementation detail,
and it obscures the actual limitations of the approach,
in that eventually in an ideal world all drivers would be secure
and use this API.

Yes, if that's the API that PCI gains then virtio will use it.


> > How about creating a defconfig that makes sense for TDX then?
> 
> TDX can be used in many different ways, I don't think a defconfig is
> practical.
> 
> In theory you could do some Kconfig dependency (at the pain point of having
> separate kernel binariees), but why not just do it at run time then if you
> maintain the list anyways. That's much easier and saner for everyone. In the
> past we usually always ended up with runtime mechanism for similar things
> anyways.
> 
> Also it turns out that the filter mechanisms are needed for some arch
> drivers which are not even configurable, so alone it's probably not enough,


I guess they aren't really needed though right, or you won't try to
filter them? So make them configurable?

> 
> > Anyone deviating from that better know what they are doing,
> > this API tweaking is just putting policy into the kernel  ...
> 
> Hardening drivers is kernel policy. It cannot be done anywhere else.
> 
> 
> -Andi

To clarify, the policy is which drivers to load into the kernel.
Andi Kleen Oct. 11, 2021, 5:23 p.m. UTC | #7
On 10/11/2021 12:58 AM, Christoph Hellwig wrote:
> Just as last time:  This does not make any sense.  ioremap is shared
> by definition.

It's not necessarily shared with the host for confidential computing: 
for example BIOS mappings definitely should not be shared, but they're 
using ioremap today.

But if you have a better term please propose something. I tried to 
clarify it with "shared_host", but I don't know a better term.


-Andi
Andi Kleen Oct. 11, 2021, 5:32 p.m. UTC | #8
> Because it does not end with I/O operations, that's a trivial example.
> module unloading is famous for being racy: I just re-read that part of
> virtio drivers and sure enough we have bugs there, this is after
> they have presumably been audited, so a TDX guest is better off
> just disabling hot-unplug completely, and hotplug isn't far behind.

These all shouldn't matter for a confidential guest. The only way it can 
be attacked is through IO, everything else is protected by hardware.


Also it would all require doing something at the guest level, which we 
assume is not malicious.


> Malicious filesystems can exploit many linux systems unless
> you take pains to limit what is mounted and how.

That's expected to be handled by authenticated dmcrypt and similar. 
Hardening at this level has been done for many years.


> Networking devices tend to get into the default namespaces and can
> do more or less whatever CAP_NET_ADMIN can.
> Etc.


Networking should be already hardened, otherwise you would have much 
worse problems today.



> hange in your subsystem here.
> Well I commented on the API patch, not the virtio patch.
> If it's a way for a driver to say "I am hardened
> and audited" then I guess it should at least say so.


This is handled by the central allow list. We intentionally didn't want 
each driver to declare itself, but have a central list where changes 
will get more scrutiny than random driver code.

But then there are the additional opt-ins for the low level firewall. 
These are in the API. I don't see how it could be done at the driver 
level, unless you want to pass in a struct device everywhere?

>>> How about creating a defconfig that makes sense for TDX then?
>> TDX can be used in many different ways, I don't think a defconfig is
>> practical.
>>
>> In theory you could do some Kconfig dependency (at the pain point of having
>> separate kernel binariees), but why not just do it at run time then if you
>> maintain the list anyways. That's much easier and saner for everyone. In the
>> past we usually always ended up with runtime mechanism for similar things
>> anyways.
>>
>> Also it turns out that the filter mechanisms are needed for some arch
>> drivers which are not even configurable, so alone it's probably not enough,
>
> I guess they aren't really needed though right, or you won't try to
> filter them?

We're addressing most of them with the device filter for platform 
drivers. But since we cannot stop them doing ioremap IO in their init 
code they also need the low level firewall.

Some others that cannot be addressed have explicit disables.


> So make them configurable?

Why not just fix the runtime? It's much saner for everyone. Proposing to 
do things at build time sounds like we're in Linux 0.99 days.

-Andi
Michael S. Tsirkin Oct. 11, 2021, 6:22 p.m. UTC | #9
On Mon, Oct 11, 2021 at 10:32:23AM -0700, Andi Kleen wrote:
> 
> > Because it does not end with I/O operations, that's a trivial example.
> > module unloading is famous for being racy: I just re-read that part of
> > virtio drivers and sure enough we have bugs there, this is after
> > they have presumably been audited, so a TDX guest is better off
> > just disabling hot-unplug completely, and hotplug isn't far behind.
> 
> These all shouldn't matter for a confidential guest. The only way it can be
> attacked is through IO, everything else is protected by hardware.
> 
> 
> Also it would all require doing something at the guest level, which we
> assume is not malicious.
> 
> 
> > Malicious filesystems can exploit many linux systems unless
> > you take pains to limit what is mounted and how.
> 
> That's expected to be handled by authenticated dmcrypt and similar.
> Hardening at this level has been done for many years.

It's possible to do it like this, sure. But that's not the
only configuration, userspace needs to be smart about setting things up.
Which is my point really.

> 
> > Networking devices tend to get into the default namespaces and can
> > do more or less whatever CAP_NET_ADMIN can.
> > Etc.
> 
> 
> Networking should be already hardened, otherwise you would have much worse
> problems today.

Same thing. NFS is pretty common, you are saying don't do it then. Fair
enough but again, arbitrary configs just aren't going to be secure.

> 
> 
> > hange in your subsystem here.
> > Well I commented on the API patch, not the virtio patch.
> > If it's a way for a driver to say "I am hardened
> > and audited" then I guess it should at least say so.
> 
> 
> This is handled by the central allow list. We intentionally didn't want each
> driver to declare itself, but have a central list where changes will get
> more scrutiny than random driver code.

Makes sense. Additionally, distros can tweak that to their heart's
content, selecting the functionality/security balance that makes sense
for them.

> But then there are the additional opt-ins for the low level firewall. These
> are in the API. I don't see how it could be done at the driver level, unless
> you want to pass in a struct device everywhere?

I am just saying don't do it then. Don't build drivers that distro does
not want to support into kernel. And don't load them when they are
modules.

> > > > How about creating a defconfig that makes sense for TDX then?
> > > TDX can be used in many different ways, I don't think a defconfig is
> > > practical.
> > > 
> > > In theory you could do some Kconfig dependency (at the pain point of having
> > > separate kernel binariees), but why not just do it at run time then if you
> > > maintain the list anyways. That's much easier and saner for everyone. In the
> > > past we usually always ended up with runtime mechanism for similar things
> > > anyways.
> > > 
> > > Also it turns out that the filter mechanisms are needed for some arch
> > > drivers which are not even configurable, so alone it's probably not enough,
> > 
> > I guess they aren't really needed though right, or you won't try to
> > filter them?
> 
> We're addressing most of them with the device filter for platform drivers.
> But since we cannot stop them doing ioremap IO in their init code they also
> need the low level firewall.
> 
> Some others that cannot be addressed have explicit disables.
> 
> 
> > So make them configurable?
> 
> Why not just fix the runtime? It's much saner for everyone. Proposing to do
> things at build time sounds like we're in Linux 0.99 days.
> 
> -Andi

Um. Tweaking driver code is not just build time, it's development time.
At least with kconfig you don't need to patch your kernel.
Michael S. Tsirkin Oct. 11, 2021, 7:09 p.m. UTC | #10
On Mon, Oct 11, 2021 at 10:23:00AM -0700, Andi Kleen wrote:
> 
> On 10/11/2021 12:58 AM, Christoph Hellwig wrote:
> > Just as last time:  This does not make any sense.  ioremap is shared
> > by definition.
> 
> It's not necessarily shared with the host for confidential computing: for
> example BIOS mappings definitely should not be shared, but they're using
> ioremap today.

That just needs to be fixed.

> But if you have a better term please propose something. I tried to clarify
> it with "shared_host", but I don't know a better term.
> 
> 
> -Andi
> 


The reason we have trouble is that it's not clear what does the API mean
outside the realm of TDX.
If we really, truly want an API that says "ioremap and it's a hardened
driver" then I guess ioremap_hardened_driver is what you want.
Christoph Hellwig Oct. 12, 2021, 5:31 a.m. UTC | #11
On Mon, Oct 11, 2021 at 03:09:09PM -0400, Michael S. Tsirkin wrote:
> The reason we have trouble is that it's not clear what does the API mean
> outside the realm of TDX.
> If we really, truly want an API that says "ioremap and it's a hardened
> driver" then I guess ioremap_hardened_driver is what you want.

Yes.  And why would be we ioremap the BIOS anyway?  It is not I/O memory
in any of the senses we generally use ioremap for.
Dan Williams Oct. 12, 2021, 5:42 p.m. UTC | #12
On Sun, Oct 10, 2021 at 3:11 PM Andi Kleen <ak@linux.intel.com> wrote:
>
>
> On 10/9/2021 1:39 PM, Dan Williams wrote:
> > On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >> On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote:
> >>> From: Andi Kleen <ak@linux.intel.com>
> >>>
> >>> For Confidential VM guests like TDX, the host is untrusted and hence
> >>> the devices emulated by the host or any data coming from the host
> >>> cannot be trusted. So the drivers that interact with the outside world
> >>> have to be hardened by sharing memory with host on need basis
> >>> with proper hardening fixes.
> >>>
> >>> For the PCI driver case, to share the memory with the host add
> >>> pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.
> >>>
> >>> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> >>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> >> So I proposed to make all pci mappings shared, eliminating the need
> >> to patch drivers.
> >>
> >> To which Andi replied
> >>          One problem with removing the ioremap opt-in is that
> >>          it's still possible for drivers to get at devices without going through probe.
> >>
> >> To which Greg replied:
> >> https://lore.kernel.org/all/YVXBNJ431YIWwZdQ@kroah.com/
> >>          If there are in-kernel PCI drivers that do not do this, they need to be
> >>          fixed today.
> >>
> >> Can you guys resolve the differences here?
> > I agree with you and Greg here. If a driver is accessing hardware
> > resources outside of the bind lifetime of one of the devices it
> > supports, and in a way that neither modrobe-policy nor
> > device-authorization -policy infrastructure can block, that sounds
> > like a bug report.
>
> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> others) in init functions that also register drivers (thanks Elena for
> the number)
>
> Some are probably old drivers that could be fixed, but it's quite a few
> legitimate cases. For example for platform or ISA drivers that's the
> only way they can be implemented because they often have no other
> enumeration mechanism. For PCI drivers it's rarer, but also still can
> happen. One example that comes to mind here is the x86 Intel uncore
> drivers, which support a mix of MSR, ioremap and PCI config space
> accesses all from the same driver. This particular example can (and
> should be) fixed in other ways, but similar things also happen in other
> drivers, and they're not all broken. Even for the broken ones they're
> usually for some crufty old devices that has very few users, so it's
> likely untestable in practice.
>
> My point is just that the ecosystem of devices that Linux supports is
> messy enough that there are legitimate exceptions from the "First IO
> only in probe call only" rule.
>
> And we can't just fix them all. Even if we could it would be hard to
> maintain.
>
> Using a "firewall model" hooking into a few strategic points like we're
> proposing here is much saner for everyone.
>
> Now we can argue about the details. Right now what we're proposing has
> some redundancies: it has both a device model filter and low level
> filter for ioremap (this patch and some others). The low level filter is
> for catching issues that don't clearly fit into the
> "enumeration<->probe" model. You could call that redundant, but I would
> call it defense in depth or better safe than sorry. In theory it would
> be enough to have the low level opt-in only, but that would have the
> drawback that is something gets enumerated after all you would have all
> kind of weird device driver failures and in some cases even killed
> guests. So I think it makes sense to have

The "better safe-than-sorry" argument is hard to build consensus
around. The spectre mitigations ran into similar problems where the
community rightly wanted to see the details and instrument the
problematic paths rather than blanket sprinkle lfence "just to be
safe". In this case the rules about when a driver is suitably
"hardened" are vague and the overlapping policy engines are confusing.

I'd rather see more concerted efforts focused/limited core changes
rather than leaf driver changes until there is a clearer definition of
hardened. I.e. instead of jumping to the assertion that fixing up
these init-path vulnerabilities are too big to fix, dig to the next
level to provide more evidence that per-driver opt-in is the only
viable option.

For example, how many of these problematic paths are built-in to the
average kernel config? A strawman might be to add a sprinkling error
exits in the module_init() of the problematic drivers, and only fail
if the module is built-in, and let modprobe policy handle the rest.

>
>
> > Fix those drivers instead of sprinkling
> > ioremap_shared in select places and with unclear rules about when a
> > driver is allowed to do "shared" mappings.
>
> Only add it when the driver has been audited and hardened.
>
> But I agree we need on a documented process for this. I will work on
> some documentation for a proposal. But essentially I think it should be
> some variant of what Elena has outlined in her talk at Security Summit.
>
> https://static.sched.com/hosted_files/lssna2021/b6/LSS-HardeningLinuxGuestForCCC.pdf
>
> That is using extra auditing/scrutiny at review time, supported with
> some static code analysis that points to the interaction points, and
> code needs to be fuzzed explicitly.
>
> However short term it's only three virtio drivers, so this is not a
> urgent problem.
>
> > Let the new
> > device-authorization mechanism (with policy in userspace)
>
>
> Default policy in user space just seems to be a bad idea here. Who
> should know if a driver is hardened other than the kernel? Maintaining
> the list somewhere else just doesn't make sense to me.

I do not understand the maintenance burden correlation of where the
policy is driven vs where the list is maintained? Even if I agreed
with the contention that out-of-tree userspace would have a hard time
tracking the "hardened" driver list there is still an in-tree
userspace path to explore. E.g. perf maintains lists of things tightly
coupled to the kernel, this authorized device list seems to be in the
same category of data.

> Also there is the more practical problem that some devices are needed
> for booting. For example in TDX we can't print something to the console
> with this mechanism, so you would never get any output before the
> initrd. Just seems like a nightmare for debugging anything. There really
> needs to be an authorization mechanism that works reasonably early.
>
> I can see a point of having user space overrides though, but we need to
> have a sane kernel default that works early.

Right, as I suggested [1], just enough early authorization to
bootstrap/debug initramfs and then that can authorize the remainder.

[1]: https://lore.kernel.org/all/CAPcyv4im4Tsj1SnxSWe=cAHBP1mQ=zgO-D81n2BpD+_HkpitbQ@mail.gmail.com/
Andi Kleen Oct. 12, 2021, 6:35 p.m. UTC | #13
> The "better safe-than-sorry" argument is hard to build consensus
> around. The spectre mitigations ran into similar problems where the
> community rightly wanted to see the details and instrument the
> problematic paths rather than blanket sprinkle lfence "just to be
> safe".

But that was due to performance problems in hot paths. Nothing of this 
applies here.

> In this case the rules about when a driver is suitably
> "hardened" are vague and the overlapping policy engines are confusing.

What is confusing exactly?

For me it both seems very straight forward and simple (but then I'm biased)

The policy is:

- Have an allow list at driver registration.

- Have an additional opt-in for MMIO mappings (and potentially config 
space, but that's not currently there) to cover init calls completely.

>
> I'd rather see more concerted efforts focused/limited core changes
> rather than leaf driver changes until there is a clearer definition of
> hardened.

A hardened driver is a driver that

- Had similar security (not API) oriented review of its IO operations 
(mainly MMIO access, but also PCI config space) as a non privileged user 
interface (like a ioctl). That review should be focused on memory safety.

- Had some fuzzing on these IO interfaces using to be released tools.

Right now it's only three virtio drivers (console, net, block)

Really it's no different than what we do for every new unprivileged user 
interface.


> I.e. instead of jumping to the assertion that fixing up
> these init-path vulnerabilities are too big to fix, dig to the next
> level to provide more evidence that per-driver opt-in is the only
> viable option.
>
> For example, how many of these problematic paths are built-in to the
> average kernel config?

I don't think arguments from "the average kernel config" (if such a 
thing even exists) are useful. That's would be just hand waving.


> A strawman might be to add a sprinkling error
> exits in the module_init() of the problematic drivers, and only fail
> if the module is built-in, and let modprobe policy handle the rest.


That would be already hundreds of changes. I have no idea how such a 
thing could be maintained or sustained either.

Really I don't even see how these alternatives can be considered. Tree 
sweeps should always be last resort. They're a pain for everyone. But 
here they're casually thrown around as alternatives to straight forward 
one or two line changes.




>
>> Default policy in user space just seems to be a bad idea here. Who
>> should know if a driver is hardened other than the kernel? Maintaining
>> the list somewhere else just doesn't make sense to me.
> I do not understand the maintenance burden correlation of where the
> policy is driven vs where the list is maintained?

All the hardening and auditing happens in the kernel tree. So it seems 
the natural place to store the result is in the kernel tree.

But there's no single package for initrd, so you would need custom 
configurations for all the supported distros.

Also we're really arguing about a list that currently has three entries.


>   Even if I agreed
> with the contention that out-of-tree userspace would have a hard time
> tracking the "hardened" driver list there is still an in-tree
> userspace path to explore. E.g. perf maintains lists of things tightly
> coupled to the kernel, this authorized device list seems to be in the
> same category of data.

You mean the event list? perf is in the kernel tree, so it's maintained 
together with the kernel.

But we don't have a kernel initrd.



>
>> Also there is the more practical problem that some devices are needed
>> for booting. For example in TDX we can't print something to the console
>> with this mechanism, so you would never get any output before the
>> initrd. Just seems like a nightmare for debugging anything. There really
>> needs to be an authorization mechanism that works reasonably early.
>>
>> I can see a point of having user space overrides though, but we need to
>> have a sane kernel default that works early.
> Right, as I suggested [1], just enough early authorization to
> bootstrap/debug initramfs and then that can authorize the remainder.

But how do you debug the kernel then? Making early undebuggable seems 
just bad policy to me.

And if you fix if for the console why not add the two more entries for 
virtio net and block too?


-Andi
Reshetova, Elena Oct. 12, 2021, 6:36 p.m. UTC | #14
> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> others) in init functions that also register drivers (thanks Elena for
> the number)

To provide more numbers on this. What I can see so far from a smatch-based
analysis, we have 409 __init style functions (.probe & builtin/module_
_platform_driver_probe excluded) for 5.15 with allyesconfig. The number of
distinct individual IO reads (MSRs included) is much higher than 2.4k and on the
range of 30k because quite often there are more than a single IO read in the
same source function. The full list of accesses and the possible call paths is huge
for manually looking at, but here is the list of the 409 functions if anyone wants
to take a look:

['doc200x_ident_chip',
'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next',
'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',
'init_dmars', 'iommu_init_pci', 'early_amd_iommu_init',
'late_iommu_features_init', 'detect_ivrs',
'intel_prepare_irq_remapping', 'intel_enable_irq_remapping',
'intel_cleanup_irq_remapping', 'detect_intel_iommu',
'parse_ioapics_under_ir', 'si_domain_init', 'ubi_init',
'fb_console_init', 'xenbus_probe_backend_init',
'xenbus_probe_frontend_init', 'setup_vcpu_hotplug_event',
'balloon_init', 'intel_iommu_init', 'intel_rng_mod_init',
'check_tylersburg_isoch', 'dmar_table_init',
'enable_drhd_fault_handling', 'init_acpi_pm_clocksource',
'ostm_init_clksrc', 'ftm_clockevent_init', 'ftm_clocksource_init',
'kona_timer_init', 'mtk_gpt_init', 'samsung_clockevent_init',
'samsung_clocksource_init', 'sysctr_timer_init', 'mxs_timer_init',
'sun4i_timer_init', 'at91sam926x_pit_dt_init', 'owl_timer_init',
'sun5i_setup_clockevent', 'ubi_gluebi_init', 'ubiblock_init',
'hv_init_tsc_clocksource', 'hv_init_clocksource', 'mt7621_clk_init',
'samsung_clk_register_mux', 'samsung_clk_register_gate',
'samsung_clk_register_fixed_rate', 'clk_boston_setup',
'gemini_cc_init', 'aspeed_ast2400_cc', 'aspeed_ast2500_cc',
'sun6i_rtc_clk_init', 'phy_init', 'ingenic_ost_register_clock',
'meson6_timer_init', 'atcpit100_timer_init',
'npcm7xx_clocksource_init', 'clksrc_dbx500_prcmu_init', 'skx_init',
'i10nm_init', 'sbridge_init', 'i82975x_init', 'i3000_init',
'x38_init', 'ie31200_init', 'i3200_init', 'amd64_edac_init',
'pnd2_init', 'edac_init', 'adummy_init', 'mtd_stresstest_init',
'bxt_idle_state_table_update', 'sklh_idle_state_table_update',
'skx_idle_state_table_update',
'acpi_gpio_handle_deferred_request_irqs', 'smc_findirq', 'ltpc_probe',
'com90io_probe', 'com90xx_probe', 'pcnet32_init_module',
'it87_gpio_init', 'f7188x_find', 'it8712f_wdt_find', 'f71808e_find',
'it87_wdt_init', 'f71882fg_find', 'it87_find', 'f71805f_find',
'parport_pc_init', 'asic3_irq_probe', 'sch311x_detect',
'amd_gpio_init', 'dvb_init', 'dvb_register', 'em28xx_alsa_register',
'em28xx_dvb_register', 'em28xx_rc_register', 'em28xx_video_register',
'blackbird_init', 'bttv_check_chipset', 'ivtvfb_callback_init',
'init_control', 'con_init', 'cr_pll_init',
'clk_disable_unused_subtree', 'fmi_init', 'cadet_init', 'pcm20_init',
'airo_init_module', 'bnx2i_mod_init', 'bnx2fc_mod_init',
'timer_of_irq_exit', 'init', 'kempld_init', 'ivtvfb_init',
'brcmf_core_init', 'comedi_test_init', 'tlan_eisa_probe',
'timer_probe', 'of_clk_init', '__reserved_mem_init_node',
'of_irq_init', 'mace_init', 'vortex_eisa_init', 'reset_chip',
'atp_init', 'atp_probe1', 'smc_probe', 'osi_setup', 'led_init',
'el3_init_module', 'clk_sp810_of_setup', 'ltpc_probe_dma',
'com90io_found', 'check_mirror', 'arcrimi_found', 'com90xx_found',
'intel_soc_thermal_init', 'thermal_register_governors',
'thermal_unregister_governors', 'therm_lvt_init', 'tcc_cooling_init',
'powerclamp_probe', 'intel_init', 'qcom_geni_serial_earlycon_setup',
'kgdboc_early_init', 'lpuart_console_setup', 'speakup_init',
'early_console_setup', 'init_port', 'early_serial8250_setup',
'linflex_console_setup', 'pl010_console_setup', 'register_earlycon',
'of_setup_earlycon', 'slgt_init', 'moxa_init',
'parport_pc_init_superio', 'parport_pc_find_ports', 'mousedev_init',
'ses_init', 'riocm_init', 'efi_rci2_sysfs_init', 'blogic_probe',
'blogic_init', 'blogic_init_mm_probeinfo',
'blogic_init_probeinfo_list', 'blogic_checkadapter',
'blogic_rdconfig', 'blogic_inquiry', 'adpt_init',
'clk_unprepare_unused_subtree', 'aspeed_socinfo_init',
'rcar_sysc_pd_setup', 'r8a779a0_sysc_pd_setup', 'renesas_soc_init',
'rcar_rst_init', 'rmobile_setup_pm_domain', 'mcp_write_pairing_set',
'a72_b53_rac_enable_all', 'mcp_a72_b53_set',
'brcmstb_soc_device_early_init', 'imx8mq_soc_revision',
'imx8mm_soc_uid', 'imx8mm_soc_revision', 'qe_init',
'exynos5x_clk_init', 'exynos5250_clk_init', 'exynos4_get_xom',
'create_one_cmux', 'create_one_pll', 'p2041_init_periph',
'p4080_init_periph', 'p5020_init_periph', 'p5040_init_periph',
'r9a06g032_clocks_probe', 'r8a73a4_cpg_clocks_init',
'sh73a0_cpg_clocks_init', 'cpg_div6_register',
'r8a7740_cpg_clocks_init', 'cpg_mssr_register_mod_clk',
'cpg_mssr_register_core_clk', 'rcar_gen3_cpg_clk_register',
'cpg_sd_clk_register', 'r7s9210_update_clk_table',
'rz_cpg_read_mode_pins', 'rz_cpg_clocks_init',
'rcar_r8a779a0_cpg_clk_register', 'rcar_gen2_cpg_clk_register',
'sun8i_a33_ccu_setup', 'sun8i_a23_ccu_setup', 'sun5i_ccu_init',
'suniv_f1c100s_ccu_setup', 'sun6i_a31_ccu_setup',
'sun8i_v3_v3s_ccu_init', 'sun50i_h616_ccu_setup',
'sunxi_h3_h5_ccu_init', 'sun4i_ccu_init', 'kona_ccu_init',
'ns2_genpll_scr_clk_init', 'ns2_genpll_sw_clk_init',
'ns2_lcpll_ddr_clk_init', 'ns2_lcpll_ports_clk_init',
'nsp_genpll_clk_init', 'nsp_lcpll0_clk_init',
'cygnus_genpll_clk_init', 'cygnus_lcpll0_clk_init',
'cygnus_mipipll_clk_init', 'cygnus_audiopll_clk_init',
'of_fixed_mmio_clk_setup', 'xdbc_map_pci_mmio', 'xdbc_find_dbgp',
'xdbc_bios_handoff', 'xdbc_early_setup', 'ehci_setup',
'early_xdbc_parse_parameter', 'find_cap', '__find_dbgp',
'nvidia_set_debug_port', 'detect_set_debug_port',
'early_ehci_bios_handoff', 'early_dbgp_init', 'dbgp_init',
'ulpi_init', 'hidg_init', 'xdbc_init', 'brcmstb_usb_pinmap_probe',
'dell_init', 'eisa_init_device', 'mlxcpld_led_probe', 'nas_gpio_init',
'asic3_mfd_probe', 'asic3_probe', 'watchdog_init', 'ssb_modinit',
'pt_init', 'thinkpad_acpi_module_init', 'kbd_init', 'joydev_init',
'evdev_init', 'evbug_init', 'input_leds_init', 'mk712_init',
'l4_add_card', 'ns558_init', 'apanel_init', 'ct82c710_detect',
'i8042_check_aux', 'i8042_check_mux', 'i8042_probe', 'i8042_init',
'i8042_aux_test_irq', 'ocrdma_init_module', 'input_apanel_init',
'cs5535_mfgpt_init', 'geodewdt_probe', 'duramar2150_c2port_init',
'init_ohci1394_dma_on_all_controllers', 'init_ohci1394_controller',
'rionet_init', 'nonstatic_sysfs_init', 'init_pcmcia_bus',
'devlink_class_init', 'switchtec_ntb_init', 'mport_init',
'drivetemp_init', 'omap_vout_probe', 'probe_opti_vlb',
'probe_chip_type', 'legacy_check_special_cases',
'qdi65_identify_port', 'probe_qdi_vlb', 'comedi_init', 'hv_acpi_init',
'pcistub_init_devices_late', 'bcma_host_soc_register',
'bcma_bus_early_register', 'vga_arb_device_init',
'vga_arb_select_default_device', 'zf_init',
'watchdog_deferred_registration', 'wb_smsc_wdt_init',
'w83977f_wdt_init', 'ali_find_watchdog', 'pc87413_init',
'alim7101_wdt_init', 'at91_wdt_init', 'sc1200wdt_probe',
'asr_get_base_address', 'dmi_walk_early', 'dmi_sysfs_init',
'dell_smbios_init', 'acer_wmi_init', 'get_thinkpad_model_data',
'dmi_scan_machine', 'pci_assign_unassigned_resources',
'cpcihp_generic_init', 'pnpacpi_init', 'acpi_early_processor_osc',
'acpi_processor_check_duplicates', 'acpi_early_processor_set_pdc',
'acpi_ec_dsdt_probe', 'cros_ec_lpc_init', 'tpacpi_acpi_handle_locate',
'ks_pcie_init_id', 'ks_pcie_host_init', 'pci_apply_final_quirks',
'intel_uncore_init', 'qedr_init_module', 'isapnp_peek',
'isapnp_isolate', 'init_ipmi_si', 'isapnp_build_device_list',
'pnpacpi_add_device', 'erst_init', 'intel_idle_acpi_cst_extract',
'xen_acpi_processor_init', 'acpi_scan_init', 's3_wmi_probe',
'intel_opregion_present', 'extlog_init', 'intel_pstate_init',
'via_rng_mod_init', 'amd_rng_mod_init', 'ccp_init', 'init_nsc',
'init_atmel', 'intel_rng_hw_init', 'intel_init_hw_struct',
'tlclk_init', 'mwave_init', 'applicom_init', 'hdaps_init',
'tink_board_init', 'ibm_rtl_init', 'samsung_sabi_init',
'samsung_init', 'samsung_backlight_init', 'samsung_rfkill_init_swsmi',
'samsung_lid_handling_init', 'samsung_leds_init', 'samsung_sabi_diag',
'samsung_sabi_infos', 'isst_if_mbox_init', 'pmc_atom_init',
'abituguru_detect', 'hwmon_pci_quirks', 'applesmc_init',
'abituguru3_detect', 'w83627ehf_probe', 'dme1737_isa_detect',
'smsc47m1_probe', 'pcc_cpufreq_init', 'cpufreq_p4_init',
'centrino_init', 'acpi_cpufreq_init', 'pcc_cpufreq_probe',
'intel_pstate_msrs_not_valid',
'intel_pstate_platform_pwr_mgmt_exists', 'acpi_cpufreq_boost_init',
'amd_freq_sensitivity_init', 'gic_fixup_resource', 'do_floppy_init',
'get_fdc_version', 'pf_init', 'pg_init', 'pd_init', 'pcd_init',
'rio_basic_attach]
Andi Kleen Oct. 12, 2021, 6:37 p.m. UTC | #15
On 10/11/2021 10:31 PM, Christoph Hellwig wrote:
> On Mon, Oct 11, 2021 at 03:09:09PM -0400, Michael S. Tsirkin wrote:
>> The reason we have trouble is that it's not clear what does the API mean
>> outside the realm of TDX.
>> If we really, truly want an API that says "ioremap and it's a hardened
>> driver" then I guess ioremap_hardened_driver is what you want.
> Yes.  And why would be we ioremap the BIOS anyway?  It is not I/O memory
> in any of the senses we generally use ioremap for.

I/O memory is anything outside the kernel memory map.

-Andi
Andi Kleen Oct. 12, 2021, 6:38 p.m. UTC | #16
On 10/12/2021 11:36 AM, Reshetova, Elena wrote:
>> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
>> others) in init functions that also register drivers (thanks Elena for
>> the number)
> To provide more numbers on this. What I can see so far from a smatch-based
> analysis, we have 409 __init style functions (.probe & builtin/module_
> _platform_driver_probe excluded) for 5.15 with allyesconfig. The number of
> distinct individual IO reads (MSRs included) is much higher than 2.4k and on the
> range of 30k because quite often there are more than a single IO read in the
> same source function. The full list of accesses and the possible call paths is huge
> for manually looking at, but here is the list of the 409 functions if anyone wants
> to take a look:


Thanks Elena.


I suspect the true number is even higher because that doesn't include IO 
inside calls to other modules and indirect pointers, correct?


-Andi
Reshetova, Elena Oct. 12, 2021, 6:57 p.m. UTC | #17
> I suspect the true number is even higher because that doesn't include IO
> inside calls to other modules and indirect pointers, correct?

Actually everything should be included. Smatch has cross-function db and
I am using it for getting the call chains and it follows function pointers.
Also since I am starting from a list of individual read IOs, every single
base read IO in drivers/* should be covered as far as I can see. But if it uses
some weird IO wrappers then the actual list might be higher.
Dan Williams Oct. 12, 2021, 7:13 p.m. UTC | #18
On Tue, Oct 12, 2021 at 11:57 AM Reshetova, Elena
<elena.reshetova@intel.com> wrote:
>
>
> > I suspect the true number is even higher because that doesn't include IO
> > inside calls to other modules and indirect pointers, correct?
>
> Actually everything should be included. Smatch has cross-function db and
> I am using it for getting the call chains and it follows function pointers.
> Also since I am starting from a list of individual read IOs, every single
> base read IO in drivers/* should be covered as far as I can see. But if it uses
> some weird IO wrappers then the actual list might be higher.

Why analyze individual IO calls? I thought the goal here was to
disable entire classes of ioremap() users?
Andi Kleen Oct. 12, 2021, 7:49 p.m. UTC | #19
On 10/12/2021 12:13 PM, Dan Williams wrote:
> On Tue, Oct 12, 2021 at 11:57 AM Reshetova, Elena
> <elena.reshetova@intel.com> wrote:
>>
>>> I suspect the true number is even higher because that doesn't include IO
>>> inside calls to other modules and indirect pointers, correct?
>> Actually everything should be included. Smatch has cross-function db and
>> I am using it for getting the call chains and it follows function pointers.
>> Also since I am starting from a list of individual read IOs, every single
>> base read IO in drivers/* should be covered as far as I can see. But if it uses
>> some weird IO wrappers then the actual list might be higher.
> Why analyze individual IO calls? I thought the goal here was to
> disable entire classes of ioremap() users?

This is everything that would need to be moved somewhere else if we 
didn't disable the entire classes of ioremap users.

-Andi
Michael S. Tsirkin Oct. 12, 2021, 9:11 p.m. UTC | #20
On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote:
> > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> > others) in init functions that also register drivers (thanks Elena for
> > the number)
> 
> To provide more numbers on this. What I can see so far from a smatch-based
> analysis, we have 409 __init style functions (.probe & builtin/module_
> _platform_driver_probe excluded) for 5.15 with allyesconfig.

I don't think we care about allyesconfig at all though.
Just don't do that.
How about allmodconfig? This is closer to what distros actually do.
Dan Williams Oct. 12, 2021, 9:14 p.m. UTC | #21
On Tue, Oct 12, 2021 at 11:35 AM Andi Kleen <ak@linux.intel.com> wrote:
>
>
> > The "better safe-than-sorry" argument is hard to build consensus
> > around. The spectre mitigations ran into similar problems where the
> > community rightly wanted to see the details and instrument the
> > problematic paths rather than blanket sprinkle lfence "just to be
> > safe".
>
> But that was due to performance problems in hot paths. Nothing of this
> applies here.

It applies because a new API that individual driver authors is being
proposed and that's an ongoing maintenance burden that might be
mitigated by hiding that implementation detail from leaf drivers.

>
> > In this case the rules about when a driver is suitably
> > "hardened" are vague and the overlapping policy engines are confusing.
>
> What is confusing exactly?

Multiple places to go to enable functionality. The device-filter
firewall policy can collide with the ioremap access control policy.

> For me it both seems very straight forward and simple (but then I'm biased)

You seem to be having a difficult time iterating this proposal toward
consensus. I don't think the base principles are being contested as
much as the semantics, scope, and need for the proposed API that is in
the purview of all leaf driver developers.

> The policy is:
>
> - Have an allow list at driver registration.
>
> - Have an additional opt-in for MMIO mappings (and potentially config
> space, but that's not currently there) to cover init calls completely.

The proliferation of policy engines and driver special casing is the
issue. Especially in this case where the virtio use case being
opted-in is *already* in a path that has been authorized by the
device-filter policy engine. I.e. why special case the ioremap() in
virtio to be additionally authorized when the device has already been
authorized to probe? Put another way, the easiest driver API change to
merge would be no additional changes in leaf drivers.

>
> >
> > I'd rather see more concerted efforts focused/limited core changes
> > rather than leaf driver changes until there is a clearer definition of
> > hardened.
>
> A hardened driver is a driver that
>
> - Had similar security (not API) oriented review of its IO operations
> (mainly MMIO access, but also PCI config space) as a non privileged user
> interface (like a ioctl). That review should be focused on memory safety.
>
> - Had some fuzzing on these IO interfaces using to be released tools.

What is the intersection of ioremap() users that are outside of the
proposed probe authorization regime AND want confidential computing
support?

> Right now it's only three virtio drivers (console, net, block)
>
> Really it's no different than what we do for every new unprivileged user
> interface.
>
>
> > I.e. instead of jumping to the assertion that fixing up
> > these init-path vulnerabilities are too big to fix, dig to the next
> > level to provide more evidence that per-driver opt-in is the only
> > viable option.
> >
> > For example, how many of these problematic paths are built-in to the
> > average kernel config?
>
> I don't think arguments from "the average kernel config" (if such a
> thing even exists) are useful. That's would be just hand waving.

I'm trying to bridge to your contention that this enabling can not
rely on custom kernel configs and must offer protection on the same
kernel image that might ship in the host, but lets set this aside and
focus on when and where leaf drivers need to adopt a new API.

> > A strawman might be to add a sprinkling error
> > exits in the module_init() of the problematic drivers, and only fail
> > if the module is built-in, and let modprobe policy handle the rest.
>
>
> That would be already hundreds of changes. I have no idea how such a
> thing could be maintained or sustained either.
>
> Really I don't even see how these alternatives can be considered. Tree
> sweeps should always be last resort. They're a pain for everyone. But
> here they're casually thrown around as alternatives to straight forward
> one or two line changes.

If it looked straightforward I'm not sure we would be having this
discussion, I think it's reasonable to ask if this is a per-driver
opt-in responsibility that must be added in addition to probe
authorization.

> >> Default policy in user space just seems to be a bad idea here. Who
> >> should know if a driver is hardened other than the kernel? Maintaining
> >> the list somewhere else just doesn't make sense to me.
> > I do not understand the maintenance burden correlation of where the
> > policy is driven vs where the list is maintained?
>
> All the hardening and auditing happens in the kernel tree. So it seems
> the natural place to store the result is in the kernel tree.
>
> But there's no single package for initrd, so you would need custom
> configurations for all the supported distros.
>
> Also we're really arguing about a list that currently has three entries.
>
>
> >   Even if I agreed
> > with the contention that out-of-tree userspace would have a hard time
> > tracking the "hardened" driver list there is still an in-tree
> > userspace path to explore. E.g. perf maintains lists of things tightly
> > coupled to the kernel, this authorized device list seems to be in the
> > same category of data.
>
> You mean the event list? perf is in the kernel tree, so it's maintained
> together with the kernel.
>
> But we don't have a kernel initrd.

I'm proposing that this list is either tiny and slow moving enough for
initrd builders to track manually, or it's a data file that ships in
distro kernel packages that initrd builders can pull in.

> >> Also there is the more practical problem that some devices are needed
> >> for booting. For example in TDX we can't print something to the console
> >> with this mechanism, so you would never get any output before the
> >> initrd. Just seems like a nightmare for debugging anything. There really
> >> needs to be an authorization mechanism that works reasonably early.
> >>
> >> I can see a point of having user space overrides though, but we need to
> >> have a sane kernel default that works early.
> > Right, as I suggested [1], just enough early authorization to
> > bootstrap/debug initramfs and then that can authorize the remainder.
>
> But how do you debug the kernel then? Making early undebuggable seems
> just bad policy to me.

I am not proposing making the early undebuggable.

>
> And if you fix if for the console why not add the two more entries for
> virtio net and block too?

Again because there seems to be struggling consensus around what
criteria constitutes being added to this list. In order to move this
series forward I'm trying to find common ground.
Michael S. Tsirkin Oct. 12, 2021, 9:18 p.m. UTC | #22
On Tue, Oct 12, 2021 at 02:14:44PM -0700, Dan Williams wrote:
> Especially in this case where the virtio use case being
> opted-in is *already* in a path that has been authorized by the
> device-filter policy engine.

That's a good point. Andi, how about setting a per-device flag
if its ID has been allowed and then making pci_iomap create
a shared mapping transparently?
Andi Kleen Oct. 12, 2021, 9:24 p.m. UTC | #23
On 10/12/2021 2:18 PM, Michael S. Tsirkin wrote:
> On Tue, Oct 12, 2021 at 02:14:44PM -0700, Dan Williams wrote:
>> Especially in this case where the virtio use case being
>> opted-in is *already* in a path that has been authorized by the
>> device-filter policy engine.
> That's a good point. Andi, how about setting a per-device flag
> if its ID has been allowed and then making pci_iomap create
> a shared mapping transparently?

Yes for pci_iomap we could do that.

If someone uses raw ioremap without a device it won't work, but I don't 
think that's the case for virtio at least.

I suppose we could solve that problem if it actually happens.

-Andi
Andi Kleen Oct. 12, 2021, 9:28 p.m. UTC | #24
>> But that was due to performance problems in hot paths. Nothing of this
>> applies here.
> It applies because a new API that individual driver authors is being
> proposed and that's an ongoing maintenance burden that might be
> mitigated by hiding that implementation detail from leaf drivers.

Right now we're only talking about 2 places to change, and none of those 
are actually in individual drivers, but in the virtio generic code and 
in the MSI code.

While there might be drivers in the future that do it directly it will 
be always the exception, normal drivers don't have to deal with this.



>> For me it both seems very straight forward and simple (but then I'm biased)
> You seem to be having a difficult time iterating this proposal toward
> consensus. I don't think the base principles are being contested as
> much as the semantics, scope, and need for the proposed API that is in
> the purview of all leaf driver developers.
Right now there is no leaf driver changed at all.
>
>>> I'd rather see more concerted efforts focused/limited core changes
>>> rather than leaf driver changes until there is a clearer definition of
>>> hardened.
>> A hardened driver is a driver that
>>
>> - Had similar security (not API) oriented review of its IO operations
>> (mainly MMIO access, but also PCI config space) as a non privileged user
>> interface (like a ioctl). That review should be focused on memory safety.
>>
>> - Had some fuzzing on these IO interfaces using to be released tools.
> What is the intersection of ioremap() users that are outside of the
> proposed probe authorization regime AND want confidential computing
> support?


Right now it's zero I believe.

That is there is other low level code that sets memory shared, but it's 
not using ioremap, but some other mechanisms.

>
> are needed
>>>> for booting. For example in TDX we can't print something to the console
>>>> with this mechanism, so you would never get any output before the
>>>> initrd. Just seems like a nightmare for debugging anything. There really
>>>> needs to be an authorization mechanism that works reasonably early.
>>>>
>>>> I can see a point of having user space overrides though, but we need to
>>>> have a sane kernel default that works early.
>>> Right, as I suggested [1], just enough early authorization to
>>> bootstrap/debug initramfs and then that can authorize the remainder.
>> But how do you debug the kernel then? Making early undebuggable seems
>> just bad policy to me.
> I am not proposing making the early undebuggable.


That's the implication of moving the policy into initrd.


If only initrd can authorize then it won't be possible to authorize 
before initrd, thus the early console won't work.

-Andi
Dan Williams Oct. 12, 2021, 10 p.m. UTC | #25
On Tue, Oct 12, 2021 at 2:28 PM Andi Kleen <ak@linux.intel.com> wrote:
[..]
> >> But how do you debug the kernel then? Making early undebuggable seems
> >> just bad policy to me.
> > I am not proposing making the early undebuggable.
>
>
> That's the implication of moving the policy into initrd.
>
>
> If only initrd can authorize then it won't be possible to authorize
> before initrd, thus the early console won't work.

Again, the proposal is that the allow-list is limited to just enough
devices to startup and debug the initramfs and no more. Everything
else can be dynamic, and this allows for a powerful custom override
interface without needing to debate additional ABI like command line
overrides, and minimizes future changes to this kernel-internal
allow-list.
Reshetova, Elena Oct. 14, 2021, 6:32 a.m. UTC | #26
> On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote:
> > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> > > others) in init functions that also register drivers (thanks Elena for
> > > the number)
> >
> > To provide more numbers on this. What I can see so far from a smatch-based
> > analysis, we have 409 __init style functions (.probe & builtin/module_
> > _platform_driver_probe excluded) for 5.15 with allyesconfig.
> 
> I don't think we care about allyesconfig at all though.
> Just don't do that.
> How about allmodconfig? This is closer to what distros actually do.

It does not make any difference really for the content of the /drivers/*:
gives 408 __init style functions doing IO (.probe & builtin/module_
> > _platform_driver_probe excluded) for 5.15 with allmodconfig:

['doc200x_ident_chip',
'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next',
'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',
'init_dmars', 'iommu_init_pci', 'early_amd_iommu_init',
'late_iommu_features_init', 'detect_ivrs',
'intel_prepare_irq_remapping', 'intel_enable_irq_remapping',
'intel_cleanup_irq_remapping', 'detect_intel_iommu',
'parse_ioapics_under_ir', 'si_domain_init', 'ubi_init',
'fb_console_init', 'xenbus_probe_backend_init',
'xenbus_probe_frontend_init', 'setup_vcpu_hotplug_event',
'balloon_init', 'intel_iommu_init', 'intel_rng_mod_init',
'check_tylersburg_isoch', 'dmar_table_init',
'enable_drhd_fault_handling', 'init_acpi_pm_clocksource',
'ostm_init_clksrc', 'ftm_clockevent_init', 'ftm_clocksource_init',
'kona_timer_init', 'mtk_gpt_init', 'samsung_clockevent_init',
'samsung_clocksource_init', 'sysctr_timer_init', 'mxs_timer_init',
'sun4i_timer_init', 'at91sam926x_pit_dt_init', 'owl_timer_init',
'sun5i_setup_clockevent', 'ubi_gluebi_init', 'ubiblock_init',
'hv_init_tsc_clocksource', 'hv_init_clocksource', 'mt7621_clk_init',
'samsung_clk_register_mux', 'samsung_clk_register_gate',
'samsung_clk_register_fixed_rate', 'clk_boston_setup',
'gemini_cc_init', 'aspeed_ast2400_cc', 'aspeed_ast2500_cc',
'sun6i_rtc_clk_init', 'phy_init', 'ingenic_ost_register_clock',
'meson6_timer_init', 'atcpit100_timer_init',
'npcm7xx_clocksource_init', 'clksrc_dbx500_prcmu_init', 'skx_init',
'i10nm_init', 'sbridge_init', 'i82975x_init', 'i3000_init',
'x38_init', 'ie31200_init', 'i3200_init', 'amd64_edac_init',
'pnd2_init', 'edac_init', 'adummy_init', 'mtd_stresstest_init',
'bxt_idle_state_table_update', 'sklh_idle_state_table_update',
'skx_idle_state_table_update',
'acpi_gpio_handle_deferred_request_irqs', 'smc_findirq', 'ltpc_probe',
'com90io_probe', 'com90xx_probe', 'pcnet32_init_module',
'it87_gpio_init', 'f7188x_find', 'it8712f_wdt_find', 'f71808e_find',
'it87_wdt_init', 'f71882fg_find', 'it87_find', 'f71805f_find',
'parport_pc_init', 'asic3_irq_probe', 'sch311x_detect',
'amd_gpio_init', 'dvb_init', 'dvb_register', 'em28xx_alsa_register',
'em28xx_dvb_register', 'em28xx_rc_register', 'em28xx_video_register',
'blackbird_init', 'bttv_check_chipset', 'ivtvfb_callback_init',
'init_control', 'con_init', 'cr_pll_init',
'clk_disable_unused_subtree', 'fmi_init', 'cadet_init', 'pcm20_init',
'airo_init_module', 'bnx2i_mod_init', 'bnx2fc_mod_init',
'timer_of_irq_exit', 'init', 'kempld_init', 'ivtvfb_init',
'brcmf_core_init', 'comedi_test_init', 'tlan_eisa_probe',
'timer_probe', 'of_clk_init', '__reserved_mem_init_node',
'of_irq_init', 'mace_init', 'vortex_eisa_init', 'reset_chip',
'atp_init', 'atp_probe1', 'smc_probe', 'osi_setup', 'led_init',
'el3_init_module', 'clk_sp810_of_setup', 'ltpc_probe_dma',
'com90io_found', 'check_mirror', 'arcrimi_found', 'com90xx_found',
'intel_soc_thermal_init', 'thermal_register_governors',
'thermal_unregister_governors', 'therm_lvt_init', 'tcc_cooling_init',
'powerclamp_probe', 'intel_init', 'qcom_geni_serial_earlycon_setup',
'kgdboc_early_init', 'lpuart_console_setup', 'speakup_init',
'early_console_setup', 'init_port', 'early_serial8250_setup',
'linflex_console_setup', 'register_earlycon', 'of_setup_earlycon',
'slgt_init', 'moxa_init', 'parport_pc_init_superio',
'parport_pc_find_ports', 'mousedev_init', 'ses_init', 'riocm_init',
'efi_rci2_sysfs_init', 'blogic_probe', 'blogic_init',
'blogic_init_mm_probeinfo', 'blogic_init_probeinfo_list',
'blogic_checkadapter', 'blogic_rdconfig', 'blogic_inquiry',
'adpt_init', 'clk_unprepare_unused_subtree', 'aspeed_socinfo_init',
'rcar_sysc_pd_setup', 'r8a779a0_sysc_pd_setup', 'renesas_soc_init',
'rcar_rst_init', 'rmobile_setup_pm_domain', 'mcp_write_pairing_set',
'a72_b53_rac_enable_all', 'mcp_a72_b53_set',
'brcmstb_soc_device_early_init', 'imx8mq_soc_revision',
'imx8mm_soc_uid', 'imx8mm_soc_revision', 'qe_init',
'exynos5x_clk_init', 'exynos5250_clk_init', 'exynos4_get_xom',
'create_one_cmux', 'create_one_pll', 'p2041_init_periph',
'p4080_init_periph', 'p5020_init_periph', 'p5040_init_periph',
'r9a06g032_clocks_probe', 'r8a73a4_cpg_clocks_init',
'sh73a0_cpg_clocks_init', 'cpg_div6_register',
'r8a7740_cpg_clocks_init', 'cpg_mssr_register_mod_clk',
'cpg_mssr_register_core_clk', 'rcar_gen3_cpg_clk_register',
'cpg_sd_clk_register', 'r7s9210_update_clk_table',
'rz_cpg_read_mode_pins', 'rz_cpg_clocks_init',
'rcar_r8a779a0_cpg_clk_register', 'rcar_gen2_cpg_clk_register',
'sun8i_a33_ccu_setup', 'sun8i_a23_ccu_setup', 'sun5i_ccu_init',
'suniv_f1c100s_ccu_setup', 'sun6i_a31_ccu_setup',
'sun8i_v3_v3s_ccu_init', 'sun50i_h616_ccu_setup',
'sunxi_h3_h5_ccu_init', 'sun4i_ccu_init', 'kona_ccu_init',
'ns2_genpll_scr_clk_init', 'ns2_genpll_sw_clk_init',
'ns2_lcpll_ddr_clk_init', 'ns2_lcpll_ports_clk_init',
'nsp_genpll_clk_init', 'nsp_lcpll0_clk_init',
'cygnus_genpll_clk_init', 'cygnus_lcpll0_clk_init',
'cygnus_mipipll_clk_init', 'cygnus_audiopll_clk_init',
'of_fixed_mmio_clk_setup', 'xdbc_map_pci_mmio', 'xdbc_find_dbgp',
'xdbc_bios_handoff', 'xdbc_early_setup', 'ehci_setup',
'early_xdbc_parse_parameter', 'find_cap', '__find_dbgp',
'nvidia_set_debug_port', 'detect_set_debug_port',
'early_ehci_bios_handoff', 'early_dbgp_init', 'dbgp_init',
'ulpi_init', 'hidg_init', 'xdbc_init', 'brcmstb_usb_pinmap_probe',
'dell_init', 'eisa_init_device', 'mlxcpld_led_probe', 'nas_gpio_init',
'asic3_mfd_probe', 'asic3_probe', 'watchdog_init', 'ssb_modinit',
'pt_init', 'thinkpad_acpi_module_init', 'kbd_init', 'joydev_init',
'evdev_init', 'evbug_init', 'input_leds_init', 'mk712_init',
'l4_add_card', 'ns558_init', 'apanel_init', 'ct82c710_detect',
'i8042_check_aux', 'i8042_check_mux', 'i8042_probe', 'i8042_init',
'i8042_aux_test_irq', 'ocrdma_init_module', 'input_apanel_init',
'cs5535_mfgpt_init', 'geodewdt_probe', 'duramar2150_c2port_init',
'init_ohci1394_dma_on_all_controllers', 'init_ohci1394_controller',
'rionet_init', 'nonstatic_sysfs_init', 'init_pcmcia_bus',
'devlink_class_init', 'switchtec_ntb_init', 'mport_init',
'drivetemp_init', 'omap_vout_probe', 'probe_opti_vlb',
'probe_chip_type', 'legacy_check_special_cases',
'qdi65_identify_port', 'probe_qdi_vlb', 'comedi_init', 'hv_acpi_init',
'pcistub_init_devices_late', 'bcma_host_soc_register',
'bcma_bus_early_register', 'vga_arb_device_init',
'vga_arb_select_default_device', 'zf_init',
'watchdog_deferred_registration', 'wb_smsc_wdt_init',
'w83977f_wdt_init', 'ali_find_watchdog', 'pc87413_init',
'alim7101_wdt_init', 'at91_wdt_init', 'sc1200wdt_probe',
'asr_get_base_address', 'dmi_walk_early', 'dmi_sysfs_init',
'dell_smbios_init', 'acer_wmi_init', 'get_thinkpad_model_data',
'dmi_scan_machine', 'pci_assign_unassigned_resources',
'cpcihp_generic_init', 'pnpacpi_init', 'acpi_early_processor_osc',
'acpi_processor_check_duplicates', 'acpi_early_processor_set_pdc',
'acpi_ec_dsdt_probe', 'cros_ec_lpc_init', 'tpacpi_acpi_handle_locate',
'ks_pcie_init_id', 'ks_pcie_host_init', 'pci_apply_final_quirks',
'intel_uncore_init', 'qedr_init_module', 'isapnp_peek',
'isapnp_isolate', 'init_ipmi_si', 'isapnp_build_device_list',
'pnpacpi_add_device', 'erst_init', 'intel_idle_acpi_cst_extract',
'xen_acpi_processor_init', 'acpi_scan_init', 's3_wmi_probe',
'intel_opregion_present', 'extlog_init', 'intel_pstate_init',
'via_rng_mod_init', 'amd_rng_mod_init', 'ccp_init', 'init_nsc',
'init_atmel', 'intel_rng_hw_init', 'intel_init_hw_struct',
'tlclk_init', 'mwave_init', 'applicom_init', 'hdaps_init',
'tink_board_init', 'ibm_rtl_init', 'samsung_sabi_init',
'samsung_init', 'samsung_backlight_init', 'samsung_rfkill_init_swsmi',
'samsung_lid_handling_init', 'samsung_leds_init', 'samsung_sabi_diag',
'samsung_sabi_infos', 'isst_if_mbox_init', 'pmc_atom_init',
'abituguru_detect', 'hwmon_pci_quirks', 'applesmc_init',
'abituguru3_detect', 'w83627ehf_probe', 'dme1737_isa_detect',
'smsc47m1_probe', 'pcc_cpufreq_init', 'cpufreq_p4_init',
'centrino_init', 'acpi_cpufreq_init', 'pcc_cpufreq_probe',
'intel_pstate_msrs_not_valid',
'intel_pstate_platform_pwr_mgmt_exists', 'acpi_cpufreq_boost_init',
'amd_freq_sensitivity_init', 'gic_fixup_resource', 'do_floppy_init',
'get_fdc_version', 'pf_init', 'pg_init', 'pd_init', 'pcd_init',
'rio_basic_attach']
Michael S. Tsirkin Oct. 14, 2021, 6:57 a.m. UTC | #27
On Thu, Oct 14, 2021 at 06:32:32AM +0000, Reshetova, Elena wrote:
> > On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote:
> > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> > > > others) in init functions that also register drivers (thanks Elena for
> > > > the number)
> > >
> > > To provide more numbers on this. What I can see so far from a smatch-based
> > > analysis, we have 409 __init style functions (.probe & builtin/module_
> > > _platform_driver_probe excluded) for 5.15 with allyesconfig.
> > 
> > I don't think we care about allyesconfig at all though.
> > Just don't do that.
> > How about allmodconfig? This is closer to what distros actually do.
> 
> It does not make any difference really for the content of the /drivers/*:
> gives 408 __init style functions doing IO (.probe & builtin/module_
> > > _platform_driver_probe excluded) for 5.15 with allmodconfig:
> 
> ['doc200x_ident_chip',
> 'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
> 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
> 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
> 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
> 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
> 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next',
> 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',

Um. ARM? Which architecture is this build for?


> 'init_dmars', 'iommu_init_pci', 'early_amd_iommu_init',
> 'late_iommu_features_init', 'detect_ivrs',
> 'intel_prepare_irq_remapping', 'intel_enable_irq_remapping',
> 'intel_cleanup_irq_remapping', 'detect_intel_iommu',
> 'parse_ioapics_under_ir', 'si_domain_init', 'ubi_init',
> 'fb_console_init', 'xenbus_probe_backend_init',
> 'xenbus_probe_frontend_init', 'setup_vcpu_hotplug_event',
> 'balloon_init', 'intel_iommu_init', 'intel_rng_mod_init',
> 'check_tylersburg_isoch', 'dmar_table_init',
> 'enable_drhd_fault_handling', 'init_acpi_pm_clocksource',
> 'ostm_init_clksrc', 'ftm_clockevent_init', 'ftm_clocksource_init',
> 'kona_timer_init', 'mtk_gpt_init', 'samsung_clockevent_init',
> 'samsung_clocksource_init', 'sysctr_timer_init', 'mxs_timer_init',
> 'sun4i_timer_init', 'at91sam926x_pit_dt_init', 'owl_timer_init',
> 'sun5i_setup_clockevent', 'ubi_gluebi_init', 'ubiblock_init',
> 'hv_init_tsc_clocksource', 'hv_init_clocksource', 'mt7621_clk_init',
> 'samsung_clk_register_mux', 'samsung_clk_register_gate',
> 'samsung_clk_register_fixed_rate', 'clk_boston_setup',
> 'gemini_cc_init', 'aspeed_ast2400_cc', 'aspeed_ast2500_cc',
> 'sun6i_rtc_clk_init', 'phy_init', 'ingenic_ost_register_clock',
> 'meson6_timer_init', 'atcpit100_timer_init',
> 'npcm7xx_clocksource_init', 'clksrc_dbx500_prcmu_init', 'skx_init',
> 'i10nm_init', 'sbridge_init', 'i82975x_init', 'i3000_init',
> 'x38_init', 'ie31200_init', 'i3200_init', 'amd64_edac_init',
> 'pnd2_init', 'edac_init', 'adummy_init', 'mtd_stresstest_init',
> 'bxt_idle_state_table_update', 'sklh_idle_state_table_update',
> 'skx_idle_state_table_update',
> 'acpi_gpio_handle_deferred_request_irqs', 'smc_findirq', 'ltpc_probe',
> 'com90io_probe', 'com90xx_probe', 'pcnet32_init_module',
> 'it87_gpio_init', 'f7188x_find', 'it8712f_wdt_find', 'f71808e_find',
> 'it87_wdt_init', 'f71882fg_find', 'it87_find', 'f71805f_find',
> 'parport_pc_init', 'asic3_irq_probe', 'sch311x_detect',
> 'amd_gpio_init', 'dvb_init', 'dvb_register', 'em28xx_alsa_register',
> 'em28xx_dvb_register', 'em28xx_rc_register', 'em28xx_video_register',
> 'blackbird_init', 'bttv_check_chipset', 'ivtvfb_callback_init',
> 'init_control', 'con_init', 'cr_pll_init',
> 'clk_disable_unused_subtree', 'fmi_init', 'cadet_init', 'pcm20_init',
> 'airo_init_module', 'bnx2i_mod_init', 'bnx2fc_mod_init',
> 'timer_of_irq_exit', 'init', 'kempld_init', 'ivtvfb_init',
> 'brcmf_core_init', 'comedi_test_init', 'tlan_eisa_probe',
> 'timer_probe', 'of_clk_init', '__reserved_mem_init_node',
> 'of_irq_init', 'mace_init', 'vortex_eisa_init', 'reset_chip',
> 'atp_init', 'atp_probe1', 'smc_probe', 'osi_setup', 'led_init',
> 'el3_init_module', 'clk_sp810_of_setup', 'ltpc_probe_dma',
> 'com90io_found', 'check_mirror', 'arcrimi_found', 'com90xx_found',
> 'intel_soc_thermal_init', 'thermal_register_governors',
> 'thermal_unregister_governors', 'therm_lvt_init', 'tcc_cooling_init',
> 'powerclamp_probe', 'intel_init', 'qcom_geni_serial_earlycon_setup',
> 'kgdboc_early_init', 'lpuart_console_setup', 'speakup_init',
> 'early_console_setup', 'init_port', 'early_serial8250_setup',
> 'linflex_console_setup', 'register_earlycon', 'of_setup_earlycon',
> 'slgt_init', 'moxa_init', 'parport_pc_init_superio',
> 'parport_pc_find_ports', 'mousedev_init', 'ses_init', 'riocm_init',
> 'efi_rci2_sysfs_init', 'blogic_probe', 'blogic_init',
> 'blogic_init_mm_probeinfo', 'blogic_init_probeinfo_list',
> 'blogic_checkadapter', 'blogic_rdconfig', 'blogic_inquiry',
> 'adpt_init', 'clk_unprepare_unused_subtree', 'aspeed_socinfo_init',
> 'rcar_sysc_pd_setup', 'r8a779a0_sysc_pd_setup', 'renesas_soc_init',
> 'rcar_rst_init', 'rmobile_setup_pm_domain', 'mcp_write_pairing_set',
> 'a72_b53_rac_enable_all', 'mcp_a72_b53_set',
> 'brcmstb_soc_device_early_init', 'imx8mq_soc_revision',
> 'imx8mm_soc_uid', 'imx8mm_soc_revision', 'qe_init',
> 'exynos5x_clk_init', 'exynos5250_clk_init', 'exynos4_get_xom',
> 'create_one_cmux', 'create_one_pll', 'p2041_init_periph',
> 'p4080_init_periph', 'p5020_init_periph', 'p5040_init_periph',
> 'r9a06g032_clocks_probe', 'r8a73a4_cpg_clocks_init',
> 'sh73a0_cpg_clocks_init', 'cpg_div6_register',
> 'r8a7740_cpg_clocks_init', 'cpg_mssr_register_mod_clk',
> 'cpg_mssr_register_core_clk', 'rcar_gen3_cpg_clk_register',
> 'cpg_sd_clk_register', 'r7s9210_update_clk_table',
> 'rz_cpg_read_mode_pins', 'rz_cpg_clocks_init',
> 'rcar_r8a779a0_cpg_clk_register', 'rcar_gen2_cpg_clk_register',
> 'sun8i_a33_ccu_setup', 'sun8i_a23_ccu_setup', 'sun5i_ccu_init',
> 'suniv_f1c100s_ccu_setup', 'sun6i_a31_ccu_setup',
> 'sun8i_v3_v3s_ccu_init', 'sun50i_h616_ccu_setup',
> 'sunxi_h3_h5_ccu_init', 'sun4i_ccu_init', 'kona_ccu_init',
> 'ns2_genpll_scr_clk_init', 'ns2_genpll_sw_clk_init',
> 'ns2_lcpll_ddr_clk_init', 'ns2_lcpll_ports_clk_init',
> 'nsp_genpll_clk_init', 'nsp_lcpll0_clk_init',
> 'cygnus_genpll_clk_init', 'cygnus_lcpll0_clk_init',
> 'cygnus_mipipll_clk_init', 'cygnus_audiopll_clk_init',
> 'of_fixed_mmio_clk_setup', 'xdbc_map_pci_mmio', 'xdbc_find_dbgp',
> 'xdbc_bios_handoff', 'xdbc_early_setup', 'ehci_setup',
> 'early_xdbc_parse_parameter', 'find_cap', '__find_dbgp',
> 'nvidia_set_debug_port', 'detect_set_debug_port',
> 'early_ehci_bios_handoff', 'early_dbgp_init', 'dbgp_init',
> 'ulpi_init', 'hidg_init', 'xdbc_init', 'brcmstb_usb_pinmap_probe',
> 'dell_init', 'eisa_init_device', 'mlxcpld_led_probe', 'nas_gpio_init',
> 'asic3_mfd_probe', 'asic3_probe', 'watchdog_init', 'ssb_modinit',
> 'pt_init', 'thinkpad_acpi_module_init', 'kbd_init', 'joydev_init',
> 'evdev_init', 'evbug_init', 'input_leds_init', 'mk712_init',
> 'l4_add_card', 'ns558_init', 'apanel_init', 'ct82c710_detect',
> 'i8042_check_aux', 'i8042_check_mux', 'i8042_probe', 'i8042_init',
> 'i8042_aux_test_irq', 'ocrdma_init_module', 'input_apanel_init',
> 'cs5535_mfgpt_init', 'geodewdt_probe', 'duramar2150_c2port_init',
> 'init_ohci1394_dma_on_all_controllers', 'init_ohci1394_controller',
> 'rionet_init', 'nonstatic_sysfs_init', 'init_pcmcia_bus',
> 'devlink_class_init', 'switchtec_ntb_init', 'mport_init',
> 'drivetemp_init', 'omap_vout_probe', 'probe_opti_vlb',
> 'probe_chip_type', 'legacy_check_special_cases',
> 'qdi65_identify_port', 'probe_qdi_vlb', 'comedi_init', 'hv_acpi_init',
> 'pcistub_init_devices_late', 'bcma_host_soc_register',
> 'bcma_bus_early_register', 'vga_arb_device_init',
> 'vga_arb_select_default_device', 'zf_init',
> 'watchdog_deferred_registration', 'wb_smsc_wdt_init',
> 'w83977f_wdt_init', 'ali_find_watchdog', 'pc87413_init',
> 'alim7101_wdt_init', 'at91_wdt_init', 'sc1200wdt_probe',
> 'asr_get_base_address', 'dmi_walk_early', 'dmi_sysfs_init',
> 'dell_smbios_init', 'acer_wmi_init', 'get_thinkpad_model_data',
> 'dmi_scan_machine', 'pci_assign_unassigned_resources',
> 'cpcihp_generic_init', 'pnpacpi_init', 'acpi_early_processor_osc',
> 'acpi_processor_check_duplicates', 'acpi_early_processor_set_pdc',
> 'acpi_ec_dsdt_probe', 'cros_ec_lpc_init', 'tpacpi_acpi_handle_locate',
> 'ks_pcie_init_id', 'ks_pcie_host_init', 'pci_apply_final_quirks',
> 'intel_uncore_init', 'qedr_init_module', 'isapnp_peek',
> 'isapnp_isolate', 'init_ipmi_si', 'isapnp_build_device_list',
> 'pnpacpi_add_device', 'erst_init', 'intel_idle_acpi_cst_extract',
> 'xen_acpi_processor_init', 'acpi_scan_init', 's3_wmi_probe',
> 'intel_opregion_present', 'extlog_init', 'intel_pstate_init',
> 'via_rng_mod_init', 'amd_rng_mod_init', 'ccp_init', 'init_nsc',
> 'init_atmel', 'intel_rng_hw_init', 'intel_init_hw_struct',
> 'tlclk_init', 'mwave_init', 'applicom_init', 'hdaps_init',
> 'tink_board_init', 'ibm_rtl_init', 'samsung_sabi_init',
> 'samsung_init', 'samsung_backlight_init', 'samsung_rfkill_init_swsmi',
> 'samsung_lid_handling_init', 'samsung_leds_init', 'samsung_sabi_diag',
> 'samsung_sabi_infos', 'isst_if_mbox_init', 'pmc_atom_init',
> 'abituguru_detect', 'hwmon_pci_quirks', 'applesmc_init',
> 'abituguru3_detect', 'w83627ehf_probe', 'dme1737_isa_detect',
> 'smsc47m1_probe', 'pcc_cpufreq_init', 'cpufreq_p4_init',
> 'centrino_init', 'acpi_cpufreq_init', 'pcc_cpufreq_probe',
> 'intel_pstate_msrs_not_valid',
> 'intel_pstate_platform_pwr_mgmt_exists', 'acpi_cpufreq_boost_init',
> 'amd_freq_sensitivity_init', 'gic_fixup_resource', 'do_floppy_init',
> 'get_fdc_version', 'pf_init', 'pg_init', 'pd_init', 'pcd_init',
> 'rio_basic_attach']
Reshetova, Elena Oct. 14, 2021, 7:27 a.m. UTC | #28
> On Thu, Oct 14, 2021 at 06:32:32AM +0000, Reshetova, Elena wrote:
> > > On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote:
> > > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> > > > > others) in init functions that also register drivers (thanks Elena for
> > > > > the number)
> > > >
> > > > To provide more numbers on this. What I can see so far from a smatch-based
> > > > analysis, we have 409 __init style functions (.probe & builtin/module_
> > > > _platform_driver_probe excluded) for 5.15 with allyesconfig.
> > >
> > > I don't think we care about allyesconfig at all though.
> > > Just don't do that.
> > > How about allmodconfig? This is closer to what distros actually do.
> >
> > It does not make any difference really for the content of the /drivers/*:
> > gives 408 __init style functions doing IO (.probe & builtin/module_
> > > > _platform_driver_probe excluded) for 5.15 with allmodconfig:
> >
> > ['doc200x_ident_chip',
> > 'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
> > 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
> > 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
> > 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
> > 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
> > 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next',
> > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',
> 
> Um. ARM? Which architecture is this build for?

The list of smatch IO findings is built for x86, but the smatch cross function
database covers all archs, so when queried for all potential function callers,
it would show non x86 arch call chains also. 

Here is the original x86 finding and call chain for the 'arm_v7s_do_selftests':

  Detected low-level IO from arm_v7s_do_selftests in fun
__iommu_queue_command_sync

drivers/iommu/amd/iommu.c:1025 __iommu_queue_command_sync() error:
{15002074744551330002}
    'check_host_input' read from the host using function 'readl' to a
member of the structure 'iommu->cmd_buf_head';

__iommu_queue_command_sync()
  iommu_completion_wait()
    amd_iommu_domain_flush_complete()
      iommu_v1_map_page()
        arm_v7s_do_selftests()

So, the results can be further filtered if you want a specified arch.
Michael S. Tsirkin Oct. 14, 2021, 9:26 a.m. UTC | #29
On Thu, Oct 14, 2021 at 07:27:42AM +0000, Reshetova, Elena wrote:
> > On Thu, Oct 14, 2021 at 06:32:32AM +0000, Reshetova, Elena wrote:
> > > > On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote:
> > > > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> > > > > > others) in init functions that also register drivers (thanks Elena for
> > > > > > the number)
> > > > >
> > > > > To provide more numbers on this. What I can see so far from a smatch-based
> > > > > analysis, we have 409 __init style functions (.probe & builtin/module_
> > > > > _platform_driver_probe excluded) for 5.15 with allyesconfig.
> > > >
> > > > I don't think we care about allyesconfig at all though.
> > > > Just don't do that.
> > > > How about allmodconfig? This is closer to what distros actually do.
> > >
> > > It does not make any difference really for the content of the /drivers/*:
> > > gives 408 __init style functions doing IO (.probe & builtin/module_
> > > > > _platform_driver_probe excluded) for 5.15 with allmodconfig:
> > >
> > > ['doc200x_ident_chip',
> > > 'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
> > > 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
> > > 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
> > > 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
> > > 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
> > > 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next',
> > > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',
> > 
> > Um. ARM? Which architecture is this build for?
> 
> The list of smatch IO findings is built for x86, but the smatch cross function
> database covers all archs, so when queried for all potential function callers,
> it would show non x86 arch call chains also. 
> 
> Here is the original x86 finding and call chain for the 'arm_v7s_do_selftests':
> 
>   Detected low-level IO from arm_v7s_do_selftests in fun
> __iommu_queue_command_sync
> 
> drivers/iommu/amd/iommu.c:1025 __iommu_queue_command_sync() error:
> {15002074744551330002}
>     'check_host_input' read from the host using function 'readl' to a
> member of the structure 'iommu->cmd_buf_head';
> 
> __iommu_queue_command_sync()
>   iommu_completion_wait()
>     amd_iommu_domain_flush_complete()
>       iommu_v1_map_page()
>         arm_v7s_do_selftests()
> 
> So, the results can be further filtered if you want a specified arch. 

So what is it just for x86? Could you tell?
Michael S. Tsirkin Oct. 14, 2021, 11:49 a.m. UTC | #30
On Thu, Oct 14, 2021 at 07:27:42AM +0000, Reshetova, Elena wrote:
> > On Thu, Oct 14, 2021 at 06:32:32AM +0000, Reshetova, Elena wrote:
> > > > On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote:
> > > > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> > > > > > others) in init functions that also register drivers (thanks Elena for
> > > > > > the number)
> > > > >
> > > > > To provide more numbers on this. What I can see so far from a smatch-based
> > > > > analysis, we have 409 __init style functions (.probe & builtin/module_
> > > > > _platform_driver_probe excluded) for 5.15 with allyesconfig.
> > > >
> > > > I don't think we care about allyesconfig at all though.
> > > > Just don't do that.
> > > > How about allmodconfig? This is closer to what distros actually do.
> > >
> > > It does not make any difference really for the content of the /drivers/*:
> > > gives 408 __init style functions doing IO (.probe & builtin/module_
> > > > > _platform_driver_probe excluded) for 5.15 with allmodconfig:
> > >
> > > ['doc200x_ident_chip',
> > > 'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
> > > 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
> > > 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
> > > 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
> > > 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
> > > 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next',
> > > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',
> > 
> > Um. ARM? Which architecture is this build for?
> 
> The list of smatch IO findings is built for x86, but the smatch cross function
> database covers all archs, so when queried for all potential function callers,
> it would show non x86 arch call chains also. 
> 
> Here is the original x86 finding and call chain for the 'arm_v7s_do_selftests':
> 
>   Detected low-level IO from arm_v7s_do_selftests in fun
> __iommu_queue_command_sync
> 
> drivers/iommu/amd/iommu.c:1025 __iommu_queue_command_sync() error:
> {15002074744551330002}
>     'check_host_input' read from the host using function 'readl' to a
> member of the structure 'iommu->cmd_buf_head';
> 
> __iommu_queue_command_sync()
>   iommu_completion_wait()
>     amd_iommu_domain_flush_complete()
>       iommu_v1_map_page()
>         arm_v7s_do_selftests()
> 
> So, the results can be further filtered if you want a specified arch. 

Even better would be a typical distro build.
Reshetova, Elena Oct. 14, 2021, 12:33 p.m. UTC | #31
> On Thu, Oct 14, 2021 at 07:27:42AM +0000, Reshetova, Elena wrote:
> > > On Thu, Oct 14, 2021 at 06:32:32AM +0000, Reshetova, Elena wrote:
> > > > > On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote:
> > > > > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> > > > > > > others) in init functions that also register drivers (thanks Elena for
> > > > > > > the number)
> > > > > >
> > > > > > To provide more numbers on this. What I can see so far from a smatch-
> based
> > > > > > analysis, we have 409 __init style functions (.probe & builtin/module_
> > > > > > _platform_driver_probe excluded) for 5.15 with allyesconfig.
> > > > >
> > > > > I don't think we care about allyesconfig at all though.
> > > > > Just don't do that.
> > > > > How about allmodconfig? This is closer to what distros actually do.
> > > >
> > > > It does not make any difference really for the content of the /drivers/*:
> > > > gives 408 __init style functions doing IO (.probe & builtin/module_
> > > > > > _platform_driver_probe excluded) for 5.15 with allmodconfig:
> > > >
> > > > ['doc200x_ident_chip',
> > > > 'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
> > > > 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
> > > > 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
> > > > 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
> > > > 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
> > > > 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next',
> > > > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',
> > >
> > > Um. ARM? Which architecture is this build for?
> >
> > The list of smatch IO findings is built for x86, but the smatch cross function
> > database covers all archs, so when queried for all potential function callers,
> > it would show non x86 arch call chains also.
> >
> > Here is the original x86 finding and call chain for the 'arm_v7s_do_selftests':
> >
> >   Detected low-level IO from arm_v7s_do_selftests in fun
> > __iommu_queue_command_sync
> >
> > drivers/iommu/amd/iommu.c:1025 __iommu_queue_command_sync() error:
> > {15002074744551330002}
> >     'check_host_input' read from the host using function 'readl' to a
> > member of the structure 'iommu->cmd_buf_head';
> >
> > __iommu_queue_command_sync()
> >   iommu_completion_wait()
> >     amd_iommu_domain_flush_complete()
> >       iommu_v1_map_page()
> >         arm_v7s_do_selftests()
> >
> > So, the results can be further filtered if you want a specified arch.
> 
> So what is it just for x86? Could you tell?

I can probably figure out how to do additional filtering here, but does
it really matter for the case that is being discussed here? Andi's point was
that there quite many existing places in the kernel when low-level IO
happens before the probe stage. So I brought these numbers here.
What do you plan to do with the pure x86 results? 

And here are the full results for allyesconfig, if anyone is interested (just got permission to create
the repository today):
https://github.com/intel/ccc-linux-guest-hardening/tree/master/audit/sample_output/5.15-rc1
We will be pushing to this repo all the scripts and fuzzing setups we use as part of
our Linux guest hardening effort for confidential cloud computing, but it is going to take
some time to get all the approvals collected.  

Best Regards,
Elena.
Thomas Gleixner Oct. 17, 2021, 9:52 p.m. UTC | #32
Elena,

On Thu, Oct 14 2021 at 06:32, Elena Reshetova wrote:
>> On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote:
> It does not make any difference really for the content of the /drivers/*:
> gives 408 __init style functions doing IO (.probe & builtin/module_
>> > _platform_driver_probe excluded) for 5.15 with allmodconfig:
>
> ['doc200x_ident_chip',
> 'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
> 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
> 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
> 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
> 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
> 'ubi_gluebi_init', 'ubiblock_init'
> 'ubi_init', 'mtd_stresstest_init',

All of this is MTD and can just be disabled wholesale.

Aside of that, most of these depend on either platform devices or device
tree enumerations which are not ever available on X86.

> 'probe_acpi_namespace_devices',

> 'amd_iommu_init_pci', 'state_next',
> 'init_dmars', 'iommu_init_pci', 'early_amd_iommu_init',
> 'late_iommu_features_init', 'detect_ivrs',
> 'intel_prepare_irq_remapping', 'intel_enable_irq_remapping',
> 'intel_cleanup_irq_remapping', 'detect_intel_iommu',
> 'parse_ioapics_under_ir', 'si_domain_init',
> 'intel_iommu_init', 'dmar_table_init',
> 'enable_drhd_fault_handling',
> 'check_tylersburg_isoch', 

None of this is reachable because the initial detection which is ACPI
table based will fail for TDX. If not, it's a guest firmware problem.

> 'fb_console_init', 'xenbus_probe_backend_init',
> 'xenbus_probe_frontend_init', 'setup_vcpu_hotplug_event',
> 'balloon_init',

XEN, that's relevant because magically the TDX guest will assume that it
is a XEN instance?

> 'ostm_init_clksrc', 'ftm_clockevent_init', 'ftm_clocksource_init',
> 'kona_timer_init', 'mtk_gpt_init', 'samsung_clockevent_init',
> 'samsung_clocksource_init', 'sysctr_timer_init', 'mxs_timer_init',
> 'sun4i_timer_init', 'at91sam926x_pit_dt_init', 'owl_timer_init',
> 'sun5i_setup_clockevent',
> 'mt7621_clk_init',
> 'samsung_clk_register_mux', 'samsung_clk_register_gate',
> 'samsung_clk_register_fixed_rate', 'clk_boston_setup',
> 'gemini_cc_init', 'aspeed_ast2400_cc', 'aspeed_ast2500_cc',
> 'sun6i_rtc_clk_init', 'phy_init', 'ingenic_ost_register_clock',
> 'meson6_timer_init', 'atcpit100_timer_init',
> 'npcm7xx_clocksource_init', 'clksrc_dbx500_prcmu_init',
> 'rcar_sysc_pd_setup', 'r8a779a0_sysc_pd_setup', 'renesas_soc_init',
> 'rcar_rst_init', 'rmobile_setup_pm_domain', 'mcp_write_pairing_set',
> 'a72_b53_rac_enable_all', 'mcp_a72_b53_set',
> 'brcmstb_soc_device_early_init', 'imx8mq_soc_revision',
> 'imx8mm_soc_uid', 'imx8mm_soc_revision', 'qe_init',
> 'exynos5x_clk_init', 'exynos5250_clk_init', 'exynos4_get_xom',
> 'create_one_cmux', 'create_one_pll', 'p2041_init_periph',
> 'p4080_init_periph', 'p5020_init_periph', 'p5040_init_periph',
> 'r9a06g032_clocks_probe', 'r8a73a4_cpg_clocks_init',
> 'sh73a0_cpg_clocks_init', 'cpg_div6_register',
> 'r8a7740_cpg_clocks_init', 'cpg_mssr_register_mod_clk',
> 'cpg_mssr_register_core_clk', 'rcar_gen3_cpg_clk_register',
> 'cpg_sd_clk_register', 'r7s9210_update_clk_table',
> 'rz_cpg_read_mode_pins', 'rz_cpg_clocks_init',
> 'rcar_r8a779a0_cpg_clk_register', 'rcar_gen2_cpg_clk_register',
> 'sun8i_a33_ccu_setup', 'sun8i_a23_ccu_setup', 'sun5i_ccu_init',
> 'suniv_f1c100s_ccu_setup', 'sun6i_a31_ccu_setup',
> 'sun8i_v3_v3s_ccu_init', 'sun50i_h616_ccu_setup',
> 'sunxi_h3_h5_ccu_init', 'sun4i_ccu_init', 'kona_ccu_init',
> 'ns2_genpll_scr_clk_init', 'ns2_genpll_sw_clk_init',
> 'ns2_lcpll_ddr_clk_init', 'ns2_lcpll_ports_clk_init',
> 'nsp_genpll_clk_init', 'nsp_lcpll0_clk_init',
> 'cygnus_genpll_clk_init', 'cygnus_lcpll0_clk_init',
> 'cygnus_mipipll_clk_init', 'cygnus_audiopll_clk_init',
> 'of_fixed_mmio_clk_setup',
> 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',

ARM based drivers are initialized on x86 in which way?

> 'hv_init_tsc_clocksource', 'hv_init_clocksource',

HyperV. See XEN

> 'skx_init',
> 'i10nm_init', 'sbridge_init', 'i82975x_init', 'i3000_init',
> 'x38_init', 'ie31200_init', 'i3200_init', 'amd64_edac_init',
> 'pnd2_init', 'edac_init', 'adummy_init',

EDAC has already hypervisor checks

> 'init_acpi_pm_clocksource',

Requires ACPI table entry or command line override

> 'intel_rng_mod_init',

Has an old style PCI table which is searched via pci_get_device(). Could
do with a cleanup which converts it to proper PCI probing.

<SNIP>

So I stop here, because it would be way simpler to have the file names
but so far I could identify all of it from the top of my head.

So what are you trying to tell me? That you found tons of ioremaps in
__init functions which are completely irrelevant.

Please stop making arguments based on completely nonsensical data. It
took me less than 5 minutes to eliminate more than 50% of that list and
I'm pretty sure that I could have eliminated the bulk of the rest as
well.

The fact that a large part of this is ARM only, the fact that nobody
bothered to look at how e.g. IOMMU detection works and whether those
ioremaps actually can't be reached is hillarious.

So of these 400 instances are at least 30% ARM specific and those
cannot be reached on ARM nilly willy either because they are either
device tree or ACPI enumerated.

Claiming that it is soo much work to analyze 400 at least to the point:

  - whether they are relevant for x86 and therefore potentially TDX at
    all

  - whether they have some form of enumeration or detection which makes
    the ioremaps unreachable when the trusted BIOS is implemented
    correctly

Ijust can laugh at that, really:

  Two of my engineers have done an inventory of hundreds of cpu hotplug
  notifier instances in a couple of days some years ago. Ditto for a
  couple of hundred seqcount and a couple of hundred tasklet usage
  sites.

Sure, but it makes more security handwaving and a nice presentation to
tell people how much unsecure code there is based on half thought out
static analysis. To do a proper static analysis of this, you really
have to do a proper brain based analysis first of:

  1) Which code is relevant for x86

  2) What are the mechanisms which are used across the X86 relevant
     driver space to make these ioremap/MSR accesses actually reachable.

And of course this will not be complete, but this eliminates the vast
majority of your list. And looking at the remaining ones is not rocket
science either.

I can't take that serious at all. Come back when you have a properly
compiled list of drivers which:

  1) Can even be built for X86

  2) Do ioremap/MSR based poking unconditionally.

  3) Cannot be easily guarded off at the subsystem level

It's not going to be a huge list.

Then we can talk about facts and talk about the work required to fix
them or blacklist them in some way.

Thanks,

        tglx
Michael S. Tsirkin Oct. 17, 2021, 10:17 p.m. UTC | #33
On Thu, Oct 14, 2021 at 12:33:49PM +0000, Reshetova, Elena wrote:
> > On Thu, Oct 14, 2021 at 07:27:42AM +0000, Reshetova, Elena wrote:
> > > > On Thu, Oct 14, 2021 at 06:32:32AM +0000, Reshetova, Elena wrote:
> > > > > > On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote:
> > > > > > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> > > > > > > > others) in init functions that also register drivers (thanks Elena for
> > > > > > > > the number)
> > > > > > >
> > > > > > > To provide more numbers on this. What I can see so far from a smatch-
> > based
> > > > > > > analysis, we have 409 __init style functions (.probe & builtin/module_
> > > > > > > _platform_driver_probe excluded) for 5.15 with allyesconfig.
> > > > > >
> > > > > > I don't think we care about allyesconfig at all though.
> > > > > > Just don't do that.
> > > > > > How about allmodconfig? This is closer to what distros actually do.
> > > > >
> > > > > It does not make any difference really for the content of the /drivers/*:
> > > > > gives 408 __init style functions doing IO (.probe & builtin/module_
> > > > > > > _platform_driver_probe excluded) for 5.15 with allmodconfig:
> > > > >
> > > > > ['doc200x_ident_chip',
> > > > > 'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
> > > > > 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
> > > > > 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
> > > > > 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
> > > > > 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
> > > > > 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next',
> > > > > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',
> > > >
> > > > Um. ARM? Which architecture is this build for?
> > >
> > > The list of smatch IO findings is built for x86, but the smatch cross function
> > > database covers all archs, so when queried for all potential function callers,
> > > it would show non x86 arch call chains also.
> > >
> > > Here is the original x86 finding and call chain for the 'arm_v7s_do_selftests':
> > >
> > >   Detected low-level IO from arm_v7s_do_selftests in fun
> > > __iommu_queue_command_sync
> > >
> > > drivers/iommu/amd/iommu.c:1025 __iommu_queue_command_sync() error:
> > > {15002074744551330002}
> > >     'check_host_input' read from the host using function 'readl' to a
> > > member of the structure 'iommu->cmd_buf_head';
> > >
> > > __iommu_queue_command_sync()
> > >   iommu_completion_wait()
> > >     amd_iommu_domain_flush_complete()
> > >       iommu_v1_map_page()
> > >         arm_v7s_do_selftests()
> > >
> > > So, the results can be further filtered if you want a specified arch.
> > 
> > So what is it just for x86? Could you tell?
> 
> I can probably figure out how to do additional filtering here, but does
> it really matter for the case that is being discussed here? Andi's point was
> that there quite many existing places in the kernel when low-level IO
> happens before the probe stage. So I brought these numbers here.
> What do you plan to do with the pure x86 results? 

If the list is short - just suggest securing that ;)


> And here are the full results for allyesconfig, if anyone is interested (just got permission to create
> the repository today):
> https://github.com/intel/ccc-linux-guest-hardening/tree/master/audit/sample_output/5.15-rc1
> We will be pushing to this repo all the scripts and fuzzing setups we use as part of
> our Linux guest hardening effort for confidential cloud computing, but it is going to take
> some time to get all the approvals collected.  
> 
> Best Regards,
> Elena.
Thomas Gleixner Oct. 18, 2021, 12:55 a.m. UTC | #34
Andi,

On Sun, Oct 10 2021 at 15:11, Andi Kleen wrote:
> On 10/9/2021 1:39 PM, Dan Williams wrote:
>> I agree with you and Greg here. If a driver is accessing hardware
>> resources outside of the bind lifetime of one of the devices it
>> supports, and in a way that neither modrobe-policy nor
>> device-authorization -policy infrastructure can block, that sounds
>> like a bug report.
>
> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and 
> others) in init functions that also register drivers (thanks Elena for 
> the number)

These numbers are completely useless simply because they are based on
nonsensical criteria. See:

  https://lore.kernel.org/r/87r1cj2uad.ffs@tglx

> My point is just that the ecosystem of devices that Linux supports is 
> messy enough that there are legitimate exceptions from the "First IO 
> only in probe call only" rule.

Your point is based on your outright refusal to actualy do a proper
analysis and your outright refusal to help fixing the real problems.

All you have provided so far is handwaving based on a completely useless
analysis.

Sure, your goal is to get this TDX problem solved, but it's not going to
be solved by:

  1) Providing a nonsensical analysis

  2) Using #1 as an argument to hack some half baken interfaces into the
     kernel which allow you to tick off your checkbox and then leave the
     resulting mess for others to clean up.
 
Try again when you have factual data to back up your claims and factual
arguments which prove that the problem can't be fixed otherwise.

I might be repeating myself, but kernel development works this way:

  1) Hack your private POC - Yay!

  2) Sit down and think hard about the problems you identified in step
     #1. Do a thorough analysis.
  
  3) Come up with a sensible integration plan.

  4) Do the necessary grump work of cleanups all over the place

  5) Add sensible infrastructure which is understandable for the bulk
     of kernel/driver developers

  6) Let your feature fall in place

and not in the way you are insisting on:

  1) Hack your private POC - Yay!

  2) Define that this is the only way to do it and try to shove it down
     the throat of everyone.

  3) Getting told that this is not the way it works

  4) Insist on it forever and blame the grumpy maintainers who are just
     not understanding the great value of your approach.

  5) Go back to #2

You should know that already, but I have no problem to give that lecture
to you over and over again. I probably should create a form letter.

And no, you can bitch about me as much as you want. These are not my
personal rules and personal pet pieves. These are rules Linus cares
about very much and aside of that they just reflect common sense.

  The kernel is a common good and not the dump ground for your personal
  brain waste.

  The kernel does not serve Intel. Quite the contrary Intel depends on
  the kernel to work nicely with it's hardware. Ergo, Intel should have
  a vested interest to serve the kernel and take responsibility for it
  as a whole. And so should you as an Intel employee.

Just dumping your next half baken workaround does not cut it especially
not when it is not backed up by sensible arguments.

Please try again, but not before you have something substantial to back
up your claims.

Thanks,

	Thomas
Thomas Gleixner Oct. 18, 2021, 1:10 a.m. UTC | #35
On Mon, Oct 18 2021 at 02:55, Thomas Gleixner wrote:
> On Sun, Oct 10 2021 at 15:11, Andi Kleen wrote:
>> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and 
>> others) in init functions that also register drivers (thanks Elena for 
>> the number)
>
> These numbers are completely useless simply because they are based on
> nonsensical criteria. See:
>
>   https://lore.kernel.org/r/87r1cj2uad.ffs@tglx
>
>> My point is just that the ecosystem of devices that Linux supports is 
>> messy enough that there are legitimate exceptions from the "First IO 
>> only in probe call only" rule.
>
> Your point is based on your outright refusal to actualy do a proper
> analysis and your outright refusal to help fixing the real problems.
>
> All you have provided so far is handwaving based on a completely useless
> analysis.
>
> Sure, your goal is to get this TDX problem solved, but it's not going to
> be solved by:
>
>   1) Providing a nonsensical analysis
>
>   2) Using #1 as an argument to hack some half baken interfaces into the
>      kernel which allow you to tick off your checkbox and then leave the
>      resulting mess for others to clean up.
>  
> Try again when you have factual data to back up your claims and factual
> arguments which prove that the problem can't be fixed otherwise.
>
> I might be repeating myself, but kernel development works this way:
>
>   1) Hack your private POC - Yay!
>
>   2) Sit down and think hard about the problems you identified in step
>      #1. Do a thorough analysis.
>   
>   3) Come up with a sensible integration plan.
>
>   4) Do the necessary grump work of cleanups all over the place
>
>   5) Add sensible infrastructure which is understandable for the bulk
>      of kernel/driver developers
>
>   6) Let your feature fall in place
>
> and not in the way you are insisting on:
>
>   1) Hack your private POC - Yay!
>
>   2) Define that this is the only way to do it and try to shove it down
>      the throat of everyone.
>
>   3) Getting told that this is not the way it works
>
>   4) Insist on it forever and blame the grumpy maintainers who are just
>      not understanding the great value of your approach.
>
>   5) Go back to #2
>
> You should know that already, but I have no problem to give that lecture
> to you over and over again. I probably should create a form letter.
>
> And no, you can bitch about me as much as you want. These are not my
> personal rules and personal pet pieves. These are rules Linus cares
> about very much and aside of that they just reflect common sense.
>
>   The kernel is a common good and not the dump ground for your personal
>   brain waste.
>
>   The kernel does not serve Intel. Quite the contrary Intel depends on
>   the kernel to work nicely with it's hardware. Ergo, Intel should have
>   a vested interest to serve the kernel and take responsibility for it
>   as a whole. And so should you as an Intel employee.
>
> Just dumping your next half baken workaround does not cut it especially
> not when it is not backed up by sensible arguments.
>
> Please try again, but not before you have something substantial to back
> up your claims.

That said, I can't resist the urge to say a few words to the responsible
senior and management people at Intel in this context:

I surely know that a lot of Intel people claim that their lack of
progress is _only_ because Thomas is hard to work with and Thomas wants
unreasonable changes to their code, which I could perceive as an abuse of
myself for the purpose of self-deception. TBH, I don't give a damn.

Let me ask a few questions instead:

  - Is it unreasonable to expect that argumentations are based on facts
    and proper analysis?

  - Is it unreasonable to expect a proper integration of a new feature?

  - Does it take unreasonable effort to do a proper design?

  - Is it unreasonable to ask that he necessary cleanups are done
    upfront?

If anyone of the responsible people at Intel thinks so, then they should
speak up now and tell me in public and into my face what's so
unreasonable about that.

Thanks,

	Thomas
Reshetova, Elena Oct. 18, 2021, 7:03 a.m. UTC | #36
Thank Thomas for your review, please see my responses/comments inline.

> > 'ostm_init_clksrc', 'ftm_clockevent_init', 'ftm_clocksource_init',
> > 'kona_timer_init', 'mtk_gpt_init', 'samsung_clockevent_init',
> > 'samsung_clocksource_init', 'sysctr_timer_init', 'mxs_timer_init',
> > 'sun4i_timer_init', 'at91sam926x_pit_dt_init', 'owl_timer_init',
> > 'sun5i_setup_clockevent',
> > 'mt7621_clk_init',
> > 'samsung_clk_register_mux', 'samsung_clk_register_gate',
> > 'samsung_clk_register_fixed_rate', 'clk_boston_setup',
> > 'gemini_cc_init', 'aspeed_ast2400_cc', 'aspeed_ast2500_cc',
> > 'sun6i_rtc_clk_init', 'phy_init', 'ingenic_ost_register_clock',
> > 'meson6_timer_init', 'atcpit100_timer_init',
> > 'npcm7xx_clocksource_init', 'clksrc_dbx500_prcmu_init',
> > 'rcar_sysc_pd_setup', 'r8a779a0_sysc_pd_setup', 'renesas_soc_init',
> > 'rcar_rst_init', 'rmobile_setup_pm_domain', 'mcp_write_pairing_set',
> > 'a72_b53_rac_enable_all', 'mcp_a72_b53_set',
> > 'brcmstb_soc_device_early_init', 'imx8mq_soc_revision',
> > 'imx8mm_soc_uid', 'imx8mm_soc_revision', 'qe_init',
> > 'exynos5x_clk_init', 'exynos5250_clk_init', 'exynos4_get_xom',
> > 'create_one_cmux', 'create_one_pll', 'p2041_init_periph',
> > 'p4080_init_periph', 'p5020_init_periph', 'p5040_init_periph',
> > 'r9a06g032_clocks_probe', 'r8a73a4_cpg_clocks_init',
> > 'sh73a0_cpg_clocks_init', 'cpg_div6_register',
> > 'r8a7740_cpg_clocks_init', 'cpg_mssr_register_mod_clk',
> > 'cpg_mssr_register_core_clk', 'rcar_gen3_cpg_clk_register',
> > 'cpg_sd_clk_register', 'r7s9210_update_clk_table',
> > 'rz_cpg_read_mode_pins', 'rz_cpg_clocks_init',
> > 'rcar_r8a779a0_cpg_clk_register', 'rcar_gen2_cpg_clk_register',
> > 'sun8i_a33_ccu_setup', 'sun8i_a23_ccu_setup', 'sun5i_ccu_init',
> > 'suniv_f1c100s_ccu_setup', 'sun6i_a31_ccu_setup',
> > 'sun8i_v3_v3s_ccu_init', 'sun50i_h616_ccu_setup',
> > 'sunxi_h3_h5_ccu_init', 'sun4i_ccu_init', 'kona_ccu_init',
> > 'ns2_genpll_scr_clk_init', 'ns2_genpll_sw_clk_init',
> > 'ns2_lcpll_ddr_clk_init', 'ns2_lcpll_ports_clk_init',
> > 'nsp_genpll_clk_init', 'nsp_lcpll0_clk_init',
> > 'cygnus_genpll_clk_init', 'cygnus_lcpll0_clk_init',
> > 'cygnus_mipipll_clk_init', 'cygnus_audiopll_clk_init',
> > 'of_fixed_mmio_clk_setup',
> > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',
> 
> ARM based drivers are initialized on x86 in which way?

As I already explained to Michael the reason why ARM code is included
into this list is the fact that that smatch cross function database contains
all code paths, so when querying up for the possible ones you get everything.
I will filter this list to x86 and provide results since this seems to be important.
The reason why I don't see this as important is that the threat model we are
trying to protect against here (malicious VMM/host) is not TDX specific and 
I see no reason why in some near or further future ARM arch would also have
a confidential cloud computing solution and they would need to do exactly the
same thing. 

> 
> > 'hv_init_tsc_clocksource', 'hv_init_clocksource',
> 
> HyperV. See XEN
> 
> > 'skx_init',
> > 'i10nm_init', 'sbridge_init', 'i82975x_init', 'i3000_init',
> > 'x38_init', 'ie31200_init', 'i3200_init', 'amd64_edac_init',
> > 'pnd2_init', 'edac_init', 'adummy_init',
> 
> EDAC has already hypervisor checks
> 
> > 'init_acpi_pm_clocksource',
> 
> Requires ACPI table entry or command line override
> 
> > 'intel_rng_mod_init',
> 
> Has an old style PCI table which is searched via pci_get_device(). Could
> do with a cleanup which converts it to proper PCI probing.
> 
> <SNIP>
> 
> So I stop here, because it would be way simpler to have the file names
> but so far I could identify all of it from the top of my head.
> 
> So what are you trying to tell me? That you found tons of ioremaps in
> __init functions which are completely irrelevant.
> 
> Please stop making arguments based on completely nonsensical data. It
> took me less than 5 minutes to eliminate more than 50% of that list and
> I'm pretty sure that I could have eliminated the bulk of the rest as
> well.
> 
> The fact that a large part of this is ARM only, the fact that nobody
> bothered to look at how e.g. IOMMU detection works and whether those
> ioremaps actually can't be reached is hillarious.
> 
> So of these 400 instances are at least 30% ARM specific and those
> cannot be reached on ARM nilly willy either because they are either
> device tree or ACPI enumerated.
> 
> Claiming that it is soo much work to analyze 400 at least to the point:

Please bear in mind that the 400 function list is not complete by any means.
Many drivers define driver-specific wrappers for low-level IO and then use
these wrappers through their code. For the drivers we have audited (like virtIO)
we have manually read the code of each driver, identified these wrapper 
functions (like virtio_cread*) and then added them into the static analyzer
to track the all the invocations of these functions. How do you propose to
repeat this exercise for all the Linux kernel driver/module codebase?

> 
>   - whether they are relevant for x86 and therefore potentially TDX at
>     all
> 
>   - whether they have some form of enumeration or detection which makes
>     the ioremaps unreachable when the trusted BIOS is implemented
>     correctly
> 
> Ijust can laugh at that, really:
> 
>   Two of my engineers have done an inventory of hundreds of cpu hotplug
>   notifier instances in a couple of days some years ago. Ditto for a
>   couple of hundred seqcount and a couple of hundred tasklet usage
>   sites.
> 
> Sure, but it makes more security handwaving and a nice presentation to
> tell people how much unsecure code there is based on half thought out
> static analysis. To do a proper static analysis of this, you really
> have to do a proper brain based analysis first of:
> 
>   1) Which code is relevant for x86
> 
>   2) What are the mechanisms which are used across the X86 relevant
>      driver space to make these ioremap/MSR accesses actually reachable.
> 
> And of course this will not be complete, but this eliminates the vast
> majority of your list. And looking at the remaining ones is not rocket
> science either.
> 
> I can't take that serious at all. Come back when you have a properly
> compiled list of drivers which:
> 
>   1) Can even be built for X86
> 
>   2) Do ioremap/MSR based poking unconditionally.
> 
>   3) Cannot be easily guarded off at the subsystem level

I see two main problems with this approach (in addition to the above fact that
we need to spend *a lot of time* building the complete list of such functions first):

1. It is very error prone since it would involve a lot of manual code
audit done by humans. And here each mistake is a potential new CVE for the kernel
in the scope of confidential computing threat model. 

2. It would require a lot of expertise from people who would want to run
a secure protected guest kernel to make sure that their particular setup is secure.
Essentially they would need to completely repeat the hardening exercise from scratch
and verify all the involved code paths to make sure for their build, certain code is 
indeed disabled, guarded at the subsystem level, not reachable because of cupid bits, etc. etc.
Not everyone has resources to do such an analysis (I would say little do), so we will end up
with a lot of unsecure production kernels, because time to market pressure would win over the
security if doing it means so much work. 

Speaking in security terms what you propose is to start from day one analyzing the whole
existing and waste attack surface, fix all the security issues in it manually one by one and then
somewhere in 20 years from now be done with it (or maybe never because there is always
new code coming in, and some would introduce new problems).

What we are proposing is first to try to minimize the attack surface as much as possible with a simple
and well understood set of controls and then spend realistic amount of time securing this minimized
surface. Again, this is not TDX specific attack surface, but generic to any guest kernel that wants to be
secure under confidential cloud computing threat model. So, it is not us who is pushing smth into the
kernel for the sake of TDX, but we seems to be the first ones so far who cares about the whole picture
and not just to provide HW means to run a protected guest. And given that most of the drivers
have never been written with this confidential cloud computing  threat model in mind, it is going to
take time to fix all of them. This really should be a community effort and a long-running task.
Take a look on this paper for example: https://arxiv.org/pdf/2109.10660.pdf They have tried to
fuzz just small set of 22 drivers from this attack surface and found 50 security relevant bugs.
And fuzzing is of course no ultimate security testing technique. So, I don't see how without the
drastically reducing the scope of security hardening first we can end up with a secure setup.
And then as the time goes and more people looking into this attack surface, we can (hopefully)
gradually open it up. 

Best Regards,
Elena.
Greg KH Oct. 18, 2021, 12:08 p.m. UTC | #37
On Sun, Oct 10, 2021 at 03:11:23PM -0700, Andi Kleen wrote:
> 
> On 10/9/2021 1:39 PM, Dan Williams wrote:
> > On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote:
> > > > From: Andi Kleen <ak@linux.intel.com>
> > > > 
> > > > For Confidential VM guests like TDX, the host is untrusted and hence
> > > > the devices emulated by the host or any data coming from the host
> > > > cannot be trusted. So the drivers that interact with the outside world
> > > > have to be hardened by sharing memory with host on need basis
> > > > with proper hardening fixes.
> > > > 
> > > > For the PCI driver case, to share the memory with the host add
> > > > pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.
> > > > 
> > > > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> > > > Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> > > So I proposed to make all pci mappings shared, eliminating the need
> > > to patch drivers.
> > > 
> > > To which Andi replied
> > >          One problem with removing the ioremap opt-in is that
> > >          it's still possible for drivers to get at devices without going through probe.
> > > 
> > > To which Greg replied:
> > > https://lore.kernel.org/all/YVXBNJ431YIWwZdQ@kroah.com/
> > >          If there are in-kernel PCI drivers that do not do this, they need to be
> > >          fixed today.
> > > 
> > > Can you guys resolve the differences here?
> > I agree with you and Greg here. If a driver is accessing hardware
> > resources outside of the bind lifetime of one of the devices it
> > supports, and in a way that neither modrobe-policy nor
> > device-authorization -policy infrastructure can block, that sounds
> > like a bug report.
> 
> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> others) in init functions that also register drivers (thanks Elena for the
> number)
> 
> Some are probably old drivers that could be fixed, but it's quite a few
> legitimate cases. For example for platform or ISA drivers that's the only
> way they can be implemented because they often have no other enumeration
> mechanism. For PCI drivers it's rarer, but also still can happen. One
> example that comes to mind here is the x86 Intel uncore drivers, which
> support a mix of MSR, ioremap and PCI config space accesses all from the
> same driver. This particular example can (and should be) fixed in other
> ways, but similar things also happen in other drivers, and they're not all
> broken. Even for the broken ones they're usually for some crufty old devices
> that has very few users, so it's likely untestable in practice.
> 
> My point is just that the ecosystem of devices that Linux supports is messy
> enough that there are legitimate exceptions from the "First IO only in probe
> call only" rule.

No, there should not be for PCI drivers.  If there is, that is a bug
that you can, and should, fix.

> And we can't just fix them all. Even if we could it would be hard to
> maintain.

Not true at all, you can fix them, and write a simple coccinelle rule to
prevent them from ever coming back in.

> Using a "firewall model" hooking into a few strategic points like we're
> proposing here is much saner for everyone.

No it is not.  It is "easier" for you because you all do not want to fix
up all of the drivers and want to add additional code complexity on top
of the current mess that we have and then you can claim that you have
"hardened" the drivers you care about.

Despite no one ever explaining exactly what "hardened" means to me.

Again, fix the existing drivers, you have the whole source, if this is
something that you all care about, it should not be hard to do.

Stop making excuses.

greg k-h
Greg KH Oct. 18, 2021, 12:13 p.m. UTC | #38
On Tue, Oct 12, 2021 at 11:35:04AM -0700, Andi Kleen wrote:
> > I'd rather see more concerted efforts focused/limited core changes
> > rather than leaf driver changes until there is a clearer definition of
> > hardened.
> 
> A hardened driver is a driver that

Ah, you do define this, thank you!

> - Had similar security (not API) oriented review of its IO operations
> (mainly MMIO access, but also PCI config space) as a non privileged user
> interface (like a ioctl). That review should be focused on memory safety.

Where is this review done?  Where is is documented?  Who is responsible
for keeping it up to date with every code change to the driver, and to
the code that the driver calls and the code that calls the driver?

> - Had some fuzzing on these IO interfaces using to be released tools.

"some"?  What tools?  What is the input, and where is that defined?  How
much fuzzing do you claim is "good enough"?

> Right now it's only three virtio drivers (console, net, block)

Where was this work done and published?  And why only 3?

> Really it's no different than what we do for every new unprivileged user
> interface.

Really?  I have seen loads of new drivers from Intel submitted in the
past months that would fail any of the above things just based on
obvious code reviews that I end up having to do...

If you want to start a "hardened driver" effort, there's a lot of real
work that needs to be done here and documented, and explained why it can
not just be done for the whole kernel...

greg k-h
Greg KH Oct. 18, 2021, 12:15 p.m. UTC | #39
On Mon, Oct 11, 2021 at 07:59:17AM -0400, Michael S. Tsirkin wrote:
> On Sun, Oct 10, 2021 at 03:22:39PM -0700, Andi Kleen wrote:
> > 
> > > To which Andi replied
> > > 	One problem with removing the ioremap opt-in is that
> > > 	it's still possible for drivers to get at devices without going through probe.
> > > 
> > > To which Greg replied:
> > > https://lore.kernel.org/all/YVXBNJ431YIWwZdQ@kroah.com/
> > > 	If there are in-kernel PCI drivers that do not do this, they need to be
> > > 	fixed today.
> > > 
> > > Can you guys resolve the differences here?
> > 
> > 
> > I addressed this in my other mail, but we may need more discussion.
> 
> Hopefully Greg will reply to that one.

Note, when wanting Greg to reply, someone should at the very least cc:
him.

{sigh}

greg k-h
Michael S. Tsirkin Oct. 18, 2021, 1:17 p.m. UTC | #40
On Mon, Oct 18, 2021 at 02:15:47PM +0200, Greg KH wrote:
> On Mon, Oct 11, 2021 at 07:59:17AM -0400, Michael S. Tsirkin wrote:
> > On Sun, Oct 10, 2021 at 03:22:39PM -0700, Andi Kleen wrote:
> > > 
> > > > To which Andi replied
> > > > 	One problem with removing the ioremap opt-in is that
> > > > 	it's still possible for drivers to get at devices without going through probe.
> > > > 
> > > > To which Greg replied:
> > > > https://lore.kernel.org/all/YVXBNJ431YIWwZdQ@kroah.com/
> > > > 	If there are in-kernel PCI drivers that do not do this, they need to be
> > > > 	fixed today.
> > > > 
> > > > Can you guys resolve the differences here?
> > > 
> > > 
> > > I addressed this in my other mail, but we may need more discussion.
> > 
> > Hopefully Greg will reply to that one.
> 
> Note, when wanting Greg to reply, someone should at the very least cc:
> him.

"that one" being "Andi's other mail". Which I don't remember what it was,
by now. Sorry.

> {sigh}
> 
> greg k-h
diff mbox series

Patch

diff --git a/include/asm-generic/pci_iomap.h b/include/asm-generic/pci_iomap.h
index df636c6d8e6c..a4a83c8ab3cf 100644
--- a/include/asm-generic/pci_iomap.h
+++ b/include/asm-generic/pci_iomap.h
@@ -18,6 +18,12 @@  extern void __iomem *pci_iomap_range(struct pci_dev *dev, int bar,
 extern void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar,
 					unsigned long offset,
 					unsigned long maxlen);
+extern void __iomem *pci_iomap_host_shared(struct pci_dev *dev, int bar,
+					   unsigned long max);
+extern void __iomem *pci_iomap_host_shared_range(struct pci_dev *dev, int bar,
+						 unsigned long offset,
+						 unsigned long maxlen);
+
 /* Create a virtual mapping cookie for a port on a given PCI device.
  * Do not call this directly, it exists to make it easier for architectures
  * to override */
diff --git a/lib/pci_iomap.c b/lib/pci_iomap.c
index 57bd92f599ee..2816dc8715da 100644
--- a/lib/pci_iomap.c
+++ b/lib/pci_iomap.c
@@ -25,6 +25,11 @@  static void __iomem *map_ioremap_wc(phys_addr_t addr, size_t size)
 	return ioremap_wc(addr, size);
 }
 
+static void __iomem *map_ioremap_host_shared(phys_addr_t addr, size_t size)
+{
+	return ioremap_host_shared(addr, size);
+}
+
 static void __iomem *pci_iomap_range_map(struct pci_dev *dev,
 					 int bar,
 					 unsigned long offset,
@@ -106,6 +111,48 @@  void __iomem *pci_iomap_wc_range(struct pci_dev *dev,
 }
 EXPORT_SYMBOL_GPL(pci_iomap_wc_range);
 
+/**
+ * pci_iomap_host_shared_range - create a virtual shared mapping cookie
+ *				 for a PCI BAR
+ * @dev: PCI device that owns the BAR
+ * @bar: BAR number
+ * @offset: map memory at the given offset in BAR
+ * @maxlen: max length of the memory to map
+ *
+ * Remap a pci device's resources shared in a confidential guest.
+ * For more details see pci_iomap_range's documentation.
+ *
+ * @maxlen specifies the maximum length to map. To get access to
+ * the complete BAR from offset to the end, pass %0 here.
+ */
+void __iomem *pci_iomap_host_shared_range(struct pci_dev *dev, int bar,
+					  unsigned long offset,
+					  unsigned long maxlen)
+{
+	return pci_iomap_range_map(dev, bar, offset, maxlen,
+				   map_ioremap_host_shared, true);
+}
+EXPORT_SYMBOL_GPL(pci_iomap_host_shared_range);
+
+/**
+ * pci_iomap_host_shared - create a virtual shared mapping cookie for a PCI BAR
+ * @dev: PCI device that owns the BAR
+ * @bar: BAR number
+ * @maxlen: length of the memory to map
+ *
+ * See pci_iomap for details. This function creates a shared mapping
+ * with the host for confidential hosts.
+ *
+ * @maxlen specifies the maximum length to map. To get access to the
+ * complete BAR without checking for its length first, pass %0 here.
+ */
+void __iomem *pci_iomap_host_shared(struct pci_dev *dev, int bar,
+			       unsigned long maxlen)
+{
+	return pci_iomap_host_shared_range(dev, bar, 0, maxlen);
+}
+EXPORT_SYMBOL_GPL(pci_iomap_host_shared);
+
 /**
  * pci_iomap - create a virtual mapping cookie for a PCI BAR
  * @dev: PCI device that owns the BAR