diff mbox series

[RFC,v2,1/1] platform-msi: Add platform check for subdevice irq domain

Message ID 20210106022749.2769057-1-baolu.lu@linux.intel.com (mailing list archive)
State Not Applicable
Headers show
Series [RFC,v2,1/1] platform-msi: Add platform check for subdevice irq domain | expand

Commit Message

Baolu Lu Jan. 6, 2021, 2:27 a.m. UTC
The pci_subdevice_msi_create_irq_domain() should fail if the underlying
platform is not able to support IMS (Interrupt Message Storage). Otherwise,
the isolation of interrupt is not guaranteed.

For x86, IMS is only supported on bare metal for now. We could enable it
in the virtualization environments in the future if interrupt HYPERCALL
domain is supported or the hardware has the capability of interrupt
isolation for subdevices.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/linux-pci/87pn4nk7nn.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/linux-pci/877dqrnzr3.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/linux-pci/877dqqmc2h.fsf@nanos.tec.linutronix.de/
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 arch/x86/pci/common.c       | 47 +++++++++++++++++++++++++++++++++++++
 drivers/base/platform-msi.c |  8 +++++++
 include/linux/msi.h         |  1 +
 3 files changed, 56 insertions(+)


Background:
Learnt from the discussions in this thread:

https://lore.kernel.org/linux-pci/160408357912.912050.17005584526266191420.stgit@djiang5-desk3.ch.intel.com/

The device IMS (Interrupt Message Storage) should not be enabled in any
virtualization environments unless there is a HYPERCALL domain which
makes the changes in the message store managed by the hypervisor.

As the initial step, we allow the IMS to be enabled only if we are
running on the bare metal. It's easy to enable IMS in the virtualization
environments if above preconditions are met in the future.

We ever thought about moving on_bare_metal() to a generic file so that
it could be well maintained and used. But we need some suggestions about
where to put it. Your comments are very appreciated.

This patch is only for comments purpose. Please don't merge it. We will
include it in the Intel IMS implementation later once we reach a
consensus.

Change log:
v1->v2:
 - v1:
   https://lore.kernel.org/linux-pci/20201210004624.345282-1-baolu.lu@linux.intel.com/
 - Rename probably_on_bare_metal() with on_bare_metal();
 - Some vendors might use the same name for both bare metal and virtual
   environment. Before we add vendor specific code to distinguish
   between them, let's return false in on_bare_metal(). This won't
   introduce any regression. The only impact is that the coming new
   platform msi feature won't be supported until the vendor specific code
   is provided.

Best regards,
baolu

Comments

Leon Romanovsky Jan. 6, 2021, 6:06 a.m. UTC | #1
On Wed, Jan 06, 2021 at 10:27:49AM +0800, Lu Baolu wrote:
> The pci_subdevice_msi_create_irq_domain() should fail if the underlying
> platform is not able to support IMS (Interrupt Message Storage). Otherwise,
> the isolation of interrupt is not guaranteed.
>
> For x86, IMS is only supported on bare metal for now. We could enable it
> in the virtualization environments in the future if interrupt HYPERCALL
> domain is supported or the hardware has the capability of interrupt
> isolation for subdevices.
>
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lore.kernel.org/linux-pci/87pn4nk7nn.fsf@nanos.tec.linutronix.de/
> Link: https://lore.kernel.org/linux-pci/877dqrnzr3.fsf@nanos.tec.linutronix.de/
> Link: https://lore.kernel.org/linux-pci/877dqqmc2h.fsf@nanos.tec.linutronix.de/
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>  arch/x86/pci/common.c       | 47 +++++++++++++++++++++++++++++++++++++
>  drivers/base/platform-msi.c |  8 +++++++
>  include/linux/msi.h         |  1 +
>  3 files changed, 56 insertions(+)
>
>
> Background:
> Learnt from the discussions in this thread:
>
> https://lore.kernel.org/linux-pci/160408357912.912050.17005584526266191420.stgit@djiang5-desk3.ch.intel.com/
>
> The device IMS (Interrupt Message Storage) should not be enabled in any
> virtualization environments unless there is a HYPERCALL domain which
> makes the changes in the message store managed by the hypervisor.
>
> As the initial step, we allow the IMS to be enabled only if we are
> running on the bare metal. It's easy to enable IMS in the virtualization
> environments if above preconditions are met in the future.
>
> We ever thought about moving on_bare_metal() to a generic file so that
> it could be well maintained and used. But we need some suggestions about
> where to put it. Your comments are very appreciated.
>
> This patch is only for comments purpose. Please don't merge it. We will
> include it in the Intel IMS implementation later once we reach a
> consensus.
>
> Change log:
> v1->v2:
>  - v1:
>    https://lore.kernel.org/linux-pci/20201210004624.345282-1-baolu.lu@linux.intel.com/
>  - Rename probably_on_bare_metal() with on_bare_metal();
>  - Some vendors might use the same name for both bare metal and virtual
>    environment. Before we add vendor specific code to distinguish
>    between them, let's return false in on_bare_metal(). This won't
>    introduce any regression. The only impact is that the coming new
>    platform msi feature won't be supported until the vendor specific code
>    is provided.
>
> Best regards,
> baolu
>
> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
> index 3507f456fcd0..963e0401f2b2 100644
> --- a/arch/x86/pci/common.c
> +++ b/arch/x86/pci/common.c
> @@ -724,3 +724,50 @@ struct pci_dev *pci_real_dma_dev(struct pci_dev *dev)
>  	return dev;
>  }
>  #endif
> +
> +/*
> + * We want to figure out which context we are running in. But the hardware
> + * does not introduce a reliable way (instruction, CPUID leaf, MSR, whatever)
> + * which can be manipulated by the VMM to let the OS figure out where it runs.
> + * So we go with the below probably on_bare_metal() function as a replacement
> + * for definitely on_bare_metal() to go forward only for the very simple reason
> + * that this is the only option we have.
> + *
> + * People might use the same vendor name for both bare metal and virtual
> + * environment. We can remove those names once we have vendor specific code to
> + * distinguish between them.
> + */
> +static const char * const vmm_vendor_name[] = {
> +	"QEMU", "Bochs", "KVM", "Xen", "VMware", "VMW", "VMware Inc.",
> +	"innotek GmbH", "Oracle Corporation", "Parallels", "BHYVE",
> +	"Microsoft Corporation", "Amazon EC2"
> +};

Maybe it is not concern at all, but this approach will make
forward/backward compatibility without kernel upgrade impossible.

Once QEMU (example) will have needed support, someone will need to remove
the QEMU from this array, rewrite on_bare_metal() because it is not bare
vs. virtual anymore and require kernel upgrade/downgrade every time QEMU
version is switched.

Plus need to update stable@ and distros.

I'm already feeling pain from the fields while they debug such code.

Am I missing it completely?

Thanks
Baolu Lu Jan. 6, 2021, 10:10 a.m. UTC | #2
Hi Leon,

On 2021/1/6 14:06, Leon Romanovsky wrote:
> On Wed, Jan 06, 2021 at 10:27:49AM +0800, Lu Baolu wrote:
>> The pci_subdevice_msi_create_irq_domain() should fail if the underlying
>> platform is not able to support IMS (Interrupt Message Storage). Otherwise,
>> the isolation of interrupt is not guaranteed.
>>
>> For x86, IMS is only supported on bare metal for now. We could enable it
>> in the virtualization environments in the future if interrupt HYPERCALL
>> domain is supported or the hardware has the capability of interrupt
>> isolation for subdevices.
>>
>> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
>> Link: https://lore.kernel.org/linux-pci/87pn4nk7nn.fsf@nanos.tec.linutronix.de/
>> Link: https://lore.kernel.org/linux-pci/877dqrnzr3.fsf@nanos.tec.linutronix.de/
>> Link: https://lore.kernel.org/linux-pci/877dqqmc2h.fsf@nanos.tec.linutronix.de/
>> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
>> ---
>>   arch/x86/pci/common.c       | 47 +++++++++++++++++++++++++++++++++++++
>>   drivers/base/platform-msi.c |  8 +++++++
>>   include/linux/msi.h         |  1 +
>>   3 files changed, 56 insertions(+)
>>
>>
>> Background:
>> Learnt from the discussions in this thread:
>>
>> https://lore.kernel.org/linux-pci/160408357912.912050.17005584526266191420.stgit@djiang5-desk3.ch.intel.com/
>>
>> The device IMS (Interrupt Message Storage) should not be enabled in any
>> virtualization environments unless there is a HYPERCALL domain which
>> makes the changes in the message store managed by the hypervisor.
>>
>> As the initial step, we allow the IMS to be enabled only if we are
>> running on the bare metal. It's easy to enable IMS in the virtualization
>> environments if above preconditions are met in the future.
>>
>> We ever thought about moving on_bare_metal() to a generic file so that
>> it could be well maintained and used. But we need some suggestions about
>> where to put it. Your comments are very appreciated.
>>
>> This patch is only for comments purpose. Please don't merge it. We will
>> include it in the Intel IMS implementation later once we reach a
>> consensus.
>>
>> Change log:
>> v1->v2:
>>   - v1:
>>     https://lore.kernel.org/linux-pci/20201210004624.345282-1-baolu.lu@linux.intel.com/
>>   - Rename probably_on_bare_metal() with on_bare_metal();
>>   - Some vendors might use the same name for both bare metal and virtual
>>     environment. Before we add vendor specific code to distinguish
>>     between them, let's return false in on_bare_metal(). This won't
>>     introduce any regression. The only impact is that the coming new
>>     platform msi feature won't be supported until the vendor specific code
>>     is provided.
>>
>> Best regards,
>> baolu
>>
>> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
>> index 3507f456fcd0..963e0401f2b2 100644
>> --- a/arch/x86/pci/common.c
>> +++ b/arch/x86/pci/common.c
>> @@ -724,3 +724,50 @@ struct pci_dev *pci_real_dma_dev(struct pci_dev *dev)
>>   	return dev;
>>   }
>>   #endif
>> +
>> +/*
>> + * We want to figure out which context we are running in. But the hardware
>> + * does not introduce a reliable way (instruction, CPUID leaf, MSR, whatever)
>> + * which can be manipulated by the VMM to let the OS figure out where it runs.
>> + * So we go with the below probably on_bare_metal() function as a replacement
>> + * for definitely on_bare_metal() to go forward only for the very simple reason
>> + * that this is the only option we have.
>> + *
>> + * People might use the same vendor name for both bare metal and virtual
>> + * environment. We can remove those names once we have vendor specific code to
>> + * distinguish between them.
>> + */
>> +static const char * const vmm_vendor_name[] = {
>> +	"QEMU", "Bochs", "KVM", "Xen", "VMware", "VMW", "VMware Inc.",
>> +	"innotek GmbH", "Oracle Corporation", "Parallels", "BHYVE",
>> +	"Microsoft Corporation", "Amazon EC2"
>> +};
> 
> Maybe it is not concern at all, but this approach will make
> forward/backward compatibility without kernel upgrade impossible.
> 
> Once QEMU (example) will have needed support, someone will need to remove
> the QEMU from this array, rewrite on_bare_metal() because it is not bare
> vs. virtual anymore and require kernel upgrade/downgrade every time QEMU
> version is switched.
> 
> Plus need to update stable@ and distros.
> 
> I'm already feeling pain from the fields while they debug such code.
> 
> Am I missing it completely?

The basic need here is that we want to disallow a brand new feature
(device ims) to be enabled in any VMM environment.

The cpuid (X86_FEATURE_HYPERVISOR) is a good choice, but it's optional
and even not documented. So besides it, we maintain a block list
(vmm_vendor_name) which lists all possible VMM vendor names. If
dmi_match(DMI_SYS_VENDOR) hits, the new feature is not allowed to be
enabled.

This block list is a bit overkill since some vendor names could also be
used on bare metal. We will delay enabling the new feature for those
cases until we have a vendor-specific way to distinguish between bare
metal and VMM environments.

Honestly speaking, I can't see any compatible issue as it's common that
a new feature is supported in a new kernel but not in an old one.

Best regards,
baolu
Leon Romanovsky Jan. 6, 2021, 10:40 a.m. UTC | #3
On Wed, Jan 06, 2021 at 06:10:09PM +0800, Lu Baolu wrote:
> Hi Leon,
>
> On 2021/1/6 14:06, Leon Romanovsky wrote:
> > On Wed, Jan 06, 2021 at 10:27:49AM +0800, Lu Baolu wrote:
> > > The pci_subdevice_msi_create_irq_domain() should fail if the underlying
> > > platform is not able to support IMS (Interrupt Message Storage). Otherwise,
> > > the isolation of interrupt is not guaranteed.
> > >
> > > For x86, IMS is only supported on bare metal for now. We could enable it
> > > in the virtualization environments in the future if interrupt HYPERCALL
> > > domain is supported or the hardware has the capability of interrupt
> > > isolation for subdevices.
> > >
> > > Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> > > Link: https://lore.kernel.org/linux-pci/87pn4nk7nn.fsf@nanos.tec.linutronix.de/
> > > Link: https://lore.kernel.org/linux-pci/877dqrnzr3.fsf@nanos.tec.linutronix.de/
> > > Link: https://lore.kernel.org/linux-pci/877dqqmc2h.fsf@nanos.tec.linutronix.de/
> > > Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> > > ---
> > >   arch/x86/pci/common.c       | 47 +++++++++++++++++++++++++++++++++++++
> > >   drivers/base/platform-msi.c |  8 +++++++
> > >   include/linux/msi.h         |  1 +
> > >   3 files changed, 56 insertions(+)
> > >
> > >
> > > Background:
> > > Learnt from the discussions in this thread:
> > >
> > > https://lore.kernel.org/linux-pci/160408357912.912050.17005584526266191420.stgit@djiang5-desk3.ch.intel.com/
> > >
> > > The device IMS (Interrupt Message Storage) should not be enabled in any
> > > virtualization environments unless there is a HYPERCALL domain which
> > > makes the changes in the message store managed by the hypervisor.
> > >
> > > As the initial step, we allow the IMS to be enabled only if we are
> > > running on the bare metal. It's easy to enable IMS in the virtualization
> > > environments if above preconditions are met in the future.
> > >
> > > We ever thought about moving on_bare_metal() to a generic file so that
> > > it could be well maintained and used. But we need some suggestions about
> > > where to put it. Your comments are very appreciated.
> > >
> > > This patch is only for comments purpose. Please don't merge it. We will
> > > include it in the Intel IMS implementation later once we reach a
> > > consensus.
> > >
> > > Change log:
> > > v1->v2:
> > >   - v1:
> > >     https://lore.kernel.org/linux-pci/20201210004624.345282-1-baolu.lu@linux.intel.com/
> > >   - Rename probably_on_bare_metal() with on_bare_metal();
> > >   - Some vendors might use the same name for both bare metal and virtual
> > >     environment. Before we add vendor specific code to distinguish
> > >     between them, let's return false in on_bare_metal(). This won't
> > >     introduce any regression. The only impact is that the coming new
> > >     platform msi feature won't be supported until the vendor specific code
> > >     is provided.
> > >
> > > Best regards,
> > > baolu
> > >
> > > diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
> > > index 3507f456fcd0..963e0401f2b2 100644
> > > --- a/arch/x86/pci/common.c
> > > +++ b/arch/x86/pci/common.c
> > > @@ -724,3 +724,50 @@ struct pci_dev *pci_real_dma_dev(struct pci_dev *dev)
> > >   	return dev;
> > >   }
> > >   #endif
> > > +
> > > +/*
> > > + * We want to figure out which context we are running in. But the hardware
> > > + * does not introduce a reliable way (instruction, CPUID leaf, MSR, whatever)
> > > + * which can be manipulated by the VMM to let the OS figure out where it runs.
> > > + * So we go with the below probably on_bare_metal() function as a replacement
> > > + * for definitely on_bare_metal() to go forward only for the very simple reason
> > > + * that this is the only option we have.
> > > + *
> > > + * People might use the same vendor name for both bare metal and virtual
> > > + * environment. We can remove those names once we have vendor specific code to
> > > + * distinguish between them.
> > > + */
> > > +static const char * const vmm_vendor_name[] = {
> > > +	"QEMU", "Bochs", "KVM", "Xen", "VMware", "VMW", "VMware Inc.",
> > > +	"innotek GmbH", "Oracle Corporation", "Parallels", "BHYVE",
> > > +	"Microsoft Corporation", "Amazon EC2"
> > > +};
> >
> > Maybe it is not concern at all, but this approach will make
> > forward/backward compatibility without kernel upgrade impossible.
> >
> > Once QEMU (example) will have needed support, someone will need to remove
> > the QEMU from this array, rewrite on_bare_metal() because it is not bare
> > vs. virtual anymore and require kernel upgrade/downgrade every time QEMU
> > version is switched.
> >
> > Plus need to update stable@ and distros.
> >
> > I'm already feeling pain from the fields while they debug such code.
> >
> > Am I missing it completely?
>
> The basic need here is that we want to disallow a brand new feature
> (device ims) to be enabled in any VMM environment.
>
> The cpuid (X86_FEATURE_HYPERVISOR) is a good choice, but it's optional
> and even not documented. So besides it, we maintain a block list
> (vmm_vendor_name) which lists all possible VMM vendor names. If
> dmi_match(DMI_SYS_VENDOR) hits, the new feature is not allowed to be
> enabled.
>
> This block list is a bit overkill since some vendor names could also be
> used on bare metal. We will delay enabling the new feature for those
> cases until we have a vendor-specific way to distinguish between bare
> metal and VMM environments.

I understand what this patch is doing.

>
> Honestly speaking, I can't see any compatible issue as it's common that
> a new feature is supported in a new kernel but not in an old one.

Yes, and we have common patterns on how to check supported/not supported,
I'm sure that hardcoded name table inside kernel is not one of them.

I asked what will you do when QEMU will gain needed functionality?
Will you remove QEMU from this list? If yes, how such "new" kernel will
work on old QEMU versions?

Thanks

>
> Best regards,
> baolu
Jason Gunthorpe Jan. 6, 2021, 3:23 p.m. UTC | #4
On Wed, Jan 06, 2021 at 12:40:17PM +0200, Leon Romanovsky wrote:

> I asked what will you do when QEMU will gain needed functionality?
> Will you remove QEMU from this list? If yes, how such "new" kernel will
> work on old QEMU versions?

The needed functionality is some VMM hypercall, so presumably new
kernels that support calling this hypercall will be able to discover
if the VMM hypercall exists and if so superceed this entire check.

Jason
Leon Romanovsky Jan. 6, 2021, 4:01 p.m. UTC | #5
On Wed, Jan 06, 2021 at 11:23:39AM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 06, 2021 at 12:40:17PM +0200, Leon Romanovsky wrote:
>
> > I asked what will you do when QEMU will gain needed functionality?
> > Will you remove QEMU from this list? If yes, how such "new" kernel will
> > work on old QEMU versions?
>
> The needed functionality is some VMM hypercall, so presumably new
> kernels that support calling this hypercall will be able to discover
> if the VMM hypercall exists and if so superceed this entire check.

Let's not speculate, do we have well-known path?
Will such patch be taken to stable@/distros?

Thanks

>
> Jason
Baolu Lu Jan. 7, 2021, 1:55 a.m. UTC | #6
On 1/7/21 12:01 AM, Leon Romanovsky wrote:
> On Wed, Jan 06, 2021 at 11:23:39AM -0400, Jason Gunthorpe wrote:
>> On Wed, Jan 06, 2021 at 12:40:17PM +0200, Leon Romanovsky wrote:
>>
>>> I asked what will you do when QEMU will gain needed functionality?
>>> Will you remove QEMU from this list? If yes, how such "new" kernel will
>>> work on old QEMU versions?
>>
>> The needed functionality is some VMM hypercall, so presumably new
>> kernels that support calling this hypercall will be able to discover
>> if the VMM hypercall exists and if so superceed this entire check.
> 
> Let's not speculate, do we have well-known path?

All these (hypercall detect and invoke) will be done in
pci_subdevice_msi_create_irq_domain(). It will be transparent to the
callers.

> Will such patch be taken to stable@/distros?

It will not be taken to stable. For distros, it depends. They can
backport if they want to support the feature.

Best regards,
baolu
Tian, Kevin Jan. 7, 2021, 2:04 a.m. UTC | #7
> From: Leon Romanovsky <leon@kernel.org>
> Sent: Thursday, January 7, 2021 12:02 AM
> 
> On Wed, Jan 06, 2021 at 11:23:39AM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 06, 2021 at 12:40:17PM +0200, Leon Romanovsky wrote:
> >
> > > I asked what will you do when QEMU will gain needed functionality?
> > > Will you remove QEMU from this list? If yes, how such "new" kernel will
> > > work on old QEMU versions?
> >
> > The needed functionality is some VMM hypercall, so presumably new
> > kernels that support calling this hypercall will be able to discover
> > if the VMM hypercall exists and if so superceed this entire check.
> 
> Let's not speculate, do we have well-known path?
> Will such patch be taken to stable@/distros?
> 

There are two functions introduced in this patch. One is to detect whether
running on bare metal or in a virtual machine. The other is for deciding 
whether the platform supports ims. Currently the two are identical because
ims is supported only on bare metal at current stage. In the future it will look 
like below when ims can be enabled in a VM:

bool arch_support_pci_device_ims(struct pci_dev *pdev)
{
	return on_bare_metal() || hypercall_irq_domain_supported();
}

The VMM vendor list is for on_bare_metal, and suppose a vendor will
never be removed once being added to the list since the fact of running
in a VM never changes, regardless of whether this hypervisor supports
extra VMM hypercalls. hypercall_irq_domain_supported will actually 
detect in hypervisor-specific way whether ims can be enabled in a VM 
(return true only when a 'new' kernel runs on a 'new' hypervisor). In 
this way no backporting is required when running a 'new' kernel on an
'old' hypervisor.

Thanks
Kevin
Leon Romanovsky Jan. 7, 2021, 6:09 a.m. UTC | #8
On Thu, Jan 07, 2021 at 02:04:29AM +0000, Tian, Kevin wrote:
> > From: Leon Romanovsky <leon@kernel.org>
> > Sent: Thursday, January 7, 2021 12:02 AM
> >
> > On Wed, Jan 06, 2021 at 11:23:39AM -0400, Jason Gunthorpe wrote:
> > > On Wed, Jan 06, 2021 at 12:40:17PM +0200, Leon Romanovsky wrote:
> > >
> > > > I asked what will you do when QEMU will gain needed functionality?
> > > > Will you remove QEMU from this list? If yes, how such "new" kernel will
> > > > work on old QEMU versions?
> > >
> > > The needed functionality is some VMM hypercall, so presumably new
> > > kernels that support calling this hypercall will be able to discover
> > > if the VMM hypercall exists and if so superceed this entire check.
> >
> > Let's not speculate, do we have well-known path?
> > Will such patch be taken to stable@/distros?
> >
>
> There are two functions introduced in this patch. One is to detect whether
> running on bare metal or in a virtual machine. The other is for deciding
> whether the platform supports ims. Currently the two are identical because
> ims is supported only on bare metal at current stage. In the future it will look
> like below when ims can be enabled in a VM:
>
> bool arch_support_pci_device_ims(struct pci_dev *pdev)
> {
> 	return on_bare_metal() || hypercall_irq_domain_supported();
> }
>
> The VMM vendor list is for on_bare_metal, and suppose a vendor will
> never be removed once being added to the list since the fact of running
> in a VM never changes, regardless of whether this hypervisor supports
> extra VMM hypercalls.

This is what I imagined, this list will be forever, and this worries me.

I don't know if it is true or not, but guess that at least Oracle and
Microsoft bare metal devices and VMs will have same DMI_SYS_VENDOR.

It means that this on_bare_metal() function won't work reliably in many
cases. Also being part of include/linux/msi.h, at some point of time,
this function will be picked by the users outside for the non-IMS cases.

I didn't even mention custom forks of QEMU which are prohibited to change
DMI_SYS_VENDOR and private clouds with custom solutions.

The current array makes DMI_SYS_VENDOR interface as some sort of ABI. If in the future,
the QEMU will decide to use more hipster name, for example "qEmU", this function
won't work.

I'm aware that DMI_SYS_VENDOR is used heavily in the kernel code and
various names for the same company are good example how not reliable it.

The most hilarious example is "Dell/Dell Inc./Dell Inc/Dell Computer Corporation/Dell Computer",
but other companies are not far from them.

Luckily enough, this identification is used for hardware product that
was released to the market and their name will be stable for that
specific model. It is not the case here where we need to ensure future
compatibility too (old kernel on new VM emulator).

I'm not in position to say yes or no to this patch and don't have plans to do it.
Just expressing my feeling that this solution is too hacky for my taste.

Thanks
Tian, Kevin Jan. 7, 2021, 6:55 a.m. UTC | #9
> From: Leon Romanovsky <leon@kernel.org>
> Sent: Thursday, January 7, 2021 2:09 PM
> 
> On Thu, Jan 07, 2021 at 02:04:29AM +0000, Tian, Kevin wrote:
> > > From: Leon Romanovsky <leon@kernel.org>
> > > Sent: Thursday, January 7, 2021 12:02 AM
> > >
> > > On Wed, Jan 06, 2021 at 11:23:39AM -0400, Jason Gunthorpe wrote:
> > > > On Wed, Jan 06, 2021 at 12:40:17PM +0200, Leon Romanovsky wrote:
> > > >
> > > > > I asked what will you do when QEMU will gain needed functionality?
> > > > > Will you remove QEMU from this list? If yes, how such "new" kernel
> will
> > > > > work on old QEMU versions?
> > > >
> > > > The needed functionality is some VMM hypercall, so presumably new
> > > > kernels that support calling this hypercall will be able to discover
> > > > if the VMM hypercall exists and if so superceed this entire check.
> > >
> > > Let's not speculate, do we have well-known path?
> > > Will such patch be taken to stable@/distros?
> > >
> >
> > There are two functions introduced in this patch. One is to detect whether
> > running on bare metal or in a virtual machine. The other is for deciding
> > whether the platform supports ims. Currently the two are identical because
> > ims is supported only on bare metal at current stage. In the future it will
> look
> > like below when ims can be enabled in a VM:
> >
> > bool arch_support_pci_device_ims(struct pci_dev *pdev)
> > {
> > 	return on_bare_metal() || hypercall_irq_domain_supported();
> > }
> >
> > The VMM vendor list is for on_bare_metal, and suppose a vendor will
> > never be removed once being added to the list since the fact of running
> > in a VM never changes, regardless of whether this hypervisor supports
> > extra VMM hypercalls.
> 
> This is what I imagined, this list will be forever, and this worries me.
> 
> I don't know if it is true or not, but guess that at least Oracle and
> Microsoft bare metal devices and VMs will have same DMI_SYS_VENDOR.

It's true. David Woodhouse also said it's the case for Amazon EC2 instances.

> 
> It means that this on_bare_metal() function won't work reliably in many
> cases. Also being part of include/linux/msi.h, at some point of time,
> this function will be picked by the users outside for the non-IMS cases.
> 
> I didn't even mention custom forks of QEMU which are prohibited to change
> DMI_SYS_VENDOR and private clouds with custom solutions.

In this case the private QEMU forks are encouraged to set CPUID (X86_
FEATURE_HYPERVISOR) if they do plan to adopt a different vendor name.

> 
> The current array makes DMI_SYS_VENDOR interface as some sort of ABI. If
> in the future,
> the QEMU will decide to use more hipster name, for example "qEmU", this
> function
> won't work.
> 
> I'm aware that DMI_SYS_VENDOR is used heavily in the kernel code and
> various names for the same company are good example how not reliable it.
> 
> The most hilarious example is "Dell/Dell Inc./Dell Inc/Dell Computer
> Corporation/Dell Computer",
> but other companies are not far from them.
> 
> Luckily enough, this identification is used for hardware product that
> was released to the market and their name will be stable for that
> specific model. It is not the case here where we need to ensure future
> compatibility too (old kernel on new VM emulator).
> 
> I'm not in position to say yes or no to this patch and don't have plans to do it.
> Just expressing my feeling that this solution is too hacky for my taste.
> 

I agree with your worries and solely relying on DMI_SYS_VENDOR is 
definitely too hacky. In previous discussions with Thomas there is no 
elegant way to handle this situation. It has to be a heuristic approach. 
First we hope the CPUID bit is set properly in most cases thus is checked 
first. Then other heuristics can be made for the remaining cases. DMI_
SYS_VENDOR is the first hint and more can be added later. For example,
when IOMMU is present there is vendor specific way to detect whether 
it's real or virtual. Dave also mentioned some BIOS flag to indicate a
virtual machine. Now probably the real question here is whether people 
are OK with CPUID+DMI_SYS_VENDOR combo check for now (and grow 
it later) or prefer to having all identified heuristics so far in-place together...

Thanks
Kevin
Leon Romanovsky Jan. 7, 2021, 7:16 a.m. UTC | #10
On Thu, Jan 07, 2021 at 06:55:16AM +0000, Tian, Kevin wrote:
> > From: Leon Romanovsky <leon@kernel.org>
> > Sent: Thursday, January 7, 2021 2:09 PM
> >
> > On Thu, Jan 07, 2021 at 02:04:29AM +0000, Tian, Kevin wrote:
> > > > From: Leon Romanovsky <leon@kernel.org>
> > > > Sent: Thursday, January 7, 2021 12:02 AM
> > > >
> > > > On Wed, Jan 06, 2021 at 11:23:39AM -0400, Jason Gunthorpe wrote:
> > > > > On Wed, Jan 06, 2021 at 12:40:17PM +0200, Leon Romanovsky wrote:
> > > > >
> > > > > > I asked what will you do when QEMU will gain needed functionality?
> > > > > > Will you remove QEMU from this list? If yes, how such "new" kernel
> > will
> > > > > > work on old QEMU versions?
> > > > >
> > > > > The needed functionality is some VMM hypercall, so presumably new
> > > > > kernels that support calling this hypercall will be able to discover
> > > > > if the VMM hypercall exists and if so superceed this entire check.
> > > >
> > > > Let's not speculate, do we have well-known path?
> > > > Will such patch be taken to stable@/distros?
> > > >
> > >
> > > There are two functions introduced in this patch. One is to detect whether
> > > running on bare metal or in a virtual machine. The other is for deciding
> > > whether the platform supports ims. Currently the two are identical because
> > > ims is supported only on bare metal at current stage. In the future it will
> > look
> > > like below when ims can be enabled in a VM:
> > >
> > > bool arch_support_pci_device_ims(struct pci_dev *pdev)
> > > {
> > > 	return on_bare_metal() || hypercall_irq_domain_supported();
> > > }
> > >
> > > The VMM vendor list is for on_bare_metal, and suppose a vendor will
> > > never be removed once being added to the list since the fact of running
> > > in a VM never changes, regardless of whether this hypervisor supports
> > > extra VMM hypercalls.
> >
> > This is what I imagined, this list will be forever, and this worries me.
> >
> > I don't know if it is true or not, but guess that at least Oracle and
> > Microsoft bare metal devices and VMs will have same DMI_SYS_VENDOR.
>
> It's true. David Woodhouse also said it's the case for Amazon EC2 instances.
>
> >
> > It means that this on_bare_metal() function won't work reliably in many
> > cases. Also being part of include/linux/msi.h, at some point of time,
> > this function will be picked by the users outside for the non-IMS cases.
> >
> > I didn't even mention custom forks of QEMU which are prohibited to change
> > DMI_SYS_VENDOR and private clouds with custom solutions.
>
> In this case the private QEMU forks are encouraged to set CPUID (X86_
> FEATURE_HYPERVISOR) if they do plan to adopt a different vendor name.

Does QEMU set this bit when it runs in host-passthrough CPU model?

>
> >
> > The current array makes DMI_SYS_VENDOR interface as some sort of ABI. If
> > in the future,
> > the QEMU will decide to use more hipster name, for example "qEmU", this
> > function
> > won't work.
> >
> > I'm aware that DMI_SYS_VENDOR is used heavily in the kernel code and
> > various names for the same company are good example how not reliable it.
> >
> > The most hilarious example is "Dell/Dell Inc./Dell Inc/Dell Computer
> > Corporation/Dell Computer",
> > but other companies are not far from them.
> >
> > Luckily enough, this identification is used for hardware product that
> > was released to the market and their name will be stable for that
> > specific model. It is not the case here where we need to ensure future
> > compatibility too (old kernel on new VM emulator).
> >
> > I'm not in position to say yes or no to this patch and don't have plans to do it.
> > Just expressing my feeling that this solution is too hacky for my taste.
> >
>
> I agree with your worries and solely relying on DMI_SYS_VENDOR is
> definitely too hacky. In previous discussions with Thomas there is no
> elegant way to handle this situation. It has to be a heuristic approach.
> First we hope the CPUID bit is set properly in most cases thus is checked
> first. Then other heuristics can be made for the remaining cases. DMI_
> SYS_VENDOR is the first hint and more can be added later. For example,
> when IOMMU is present there is vendor specific way to detect whether
> it's real or virtual. Dave also mentioned some BIOS flag to indicate a
> virtual machine. Now probably the real question here is whether people
> are OK with CPUID+DMI_SYS_VENDOR combo check for now (and grow
> it later) or prefer to having all identified heuristics so far in-place together...

IMHO, it should be as much as possible close to the end result.

Thanks

>
> Thanks
> Kevin
Baolu Lu Jan. 12, 2021, 5:17 a.m. UTC | #11
Hi,

On 1/7/21 3:16 PM, Leon Romanovsky wrote:
> On Thu, Jan 07, 2021 at 06:55:16AM +0000, Tian, Kevin wrote:
>>> From: Leon Romanovsky <leon@kernel.org>
>>> Sent: Thursday, January 7, 2021 2:09 PM
>>>
>>> On Thu, Jan 07, 2021 at 02:04:29AM +0000, Tian, Kevin wrote:
>>>>> From: Leon Romanovsky <leon@kernel.org>
>>>>> Sent: Thursday, January 7, 2021 12:02 AM
>>>>>
>>>>> On Wed, Jan 06, 2021 at 11:23:39AM -0400, Jason Gunthorpe wrote:
>>>>>> On Wed, Jan 06, 2021 at 12:40:17PM +0200, Leon Romanovsky wrote:
>>>>>>
>>>>>>> I asked what will you do when QEMU will gain needed functionality?
>>>>>>> Will you remove QEMU from this list? If yes, how such "new" kernel
>>> will
>>>>>>> work on old QEMU versions?
>>>>>>
>>>>>> The needed functionality is some VMM hypercall, so presumably new
>>>>>> kernels that support calling this hypercall will be able to discover
>>>>>> if the VMM hypercall exists and if so superceed this entire check.
>>>>>
>>>>> Let's not speculate, do we have well-known path?
>>>>> Will such patch be taken to stable@/distros?
>>>>>
>>>>
>>>> There are two functions introduced in this patch. One is to detect whether
>>>> running on bare metal or in a virtual machine. The other is for deciding
>>>> whether the platform supports ims. Currently the two are identical because
>>>> ims is supported only on bare metal at current stage. In the future it will
>>> look
>>>> like below when ims can be enabled in a VM:
>>>>
>>>> bool arch_support_pci_device_ims(struct pci_dev *pdev)
>>>> {
>>>> 	return on_bare_metal() || hypercall_irq_domain_supported();
>>>> }
>>>>
>>>> The VMM vendor list is for on_bare_metal, and suppose a vendor will
>>>> never be removed once being added to the list since the fact of running
>>>> in a VM never changes, regardless of whether this hypervisor supports
>>>> extra VMM hypercalls.
>>>
>>> This is what I imagined, this list will be forever, and this worries me.
>>>
>>> I don't know if it is true or not, but guess that at least Oracle and
>>> Microsoft bare metal devices and VMs will have same DMI_SYS_VENDOR.
>>
>> It's true. David Woodhouse also said it's the case for Amazon EC2 instances.
>>
>>>
>>> It means that this on_bare_metal() function won't work reliably in many
>>> cases. Also being part of include/linux/msi.h, at some point of time,
>>> this function will be picked by the users outside for the non-IMS cases.
>>>
>>> I didn't even mention custom forks of QEMU which are prohibited to change
>>> DMI_SYS_VENDOR and private clouds with custom solutions.
>>
>> In this case the private QEMU forks are encouraged to set CPUID (X86_
>> FEATURE_HYPERVISOR) if they do plan to adopt a different vendor name.
> 
> Does QEMU set this bit when it runs in host-passthrough CPU model?
> 
>>
>>>
>>> The current array makes DMI_SYS_VENDOR interface as some sort of ABI. If
>>> in the future,
>>> the QEMU will decide to use more hipster name, for example "qEmU", this
>>> function
>>> won't work.
>>>
>>> I'm aware that DMI_SYS_VENDOR is used heavily in the kernel code and
>>> various names for the same company are good example how not reliable it.
>>>
>>> The most hilarious example is "Dell/Dell Inc./Dell Inc/Dell Computer
>>> Corporation/Dell Computer",
>>> but other companies are not far from them.
>>>
>>> Luckily enough, this identification is used for hardware product that
>>> was released to the market and their name will be stable for that
>>> specific model. It is not the case here where we need to ensure future
>>> compatibility too (old kernel on new VM emulator).
>>>
>>> I'm not in position to say yes or no to this patch and don't have plans to do it.
>>> Just expressing my feeling that this solution is too hacky for my taste.
>>>
>>
>> I agree with your worries and solely relying on DMI_SYS_VENDOR is
>> definitely too hacky. In previous discussions with Thomas there is no
>> elegant way to handle this situation. It has to be a heuristic approach.
>> First we hope the CPUID bit is set properly in most cases thus is checked
>> first. Then other heuristics can be made for the remaining cases. DMI_
>> SYS_VENDOR is the first hint and more can be added later. For example,
>> when IOMMU is present there is vendor specific way to detect whether
>> it's real or virtual. Dave also mentioned some BIOS flag to indicate a
>> virtual machine. Now probably the real question here is whether people
>> are OK with CPUID+DMI_SYS_VENDOR combo check for now (and grow
>> it later) or prefer to having all identified heuristics so far in-place together...
> 
> IMHO, it should be as much as possible close to the end result.

Okay! This seems to be a right way to go.

The SMBIOS defines a 'virtual machine' bit in the BIOS characteristics
extension byte. It could be used as a possible way.

In order to support emulated IOMMU for fully virtualized guest, the
iommu vendors defined methods to distinguish between bare metal and VMM
(caching mode in VT-d for example).

I will go ahead with adding above two methods before checking the block
list.

Best regards,
baolu
Leon Romanovsky Jan. 12, 2021, 5:53 a.m. UTC | #12
On Tue, Jan 12, 2021 at 01:17:11PM +0800, Lu Baolu wrote:
> Hi,
>
> On 1/7/21 3:16 PM, Leon Romanovsky wrote:
> > On Thu, Jan 07, 2021 at 06:55:16AM +0000, Tian, Kevin wrote:
> > > > From: Leon Romanovsky <leon@kernel.org>
> > > > Sent: Thursday, January 7, 2021 2:09 PM
> > > >
> > > > On Thu, Jan 07, 2021 at 02:04:29AM +0000, Tian, Kevin wrote:
> > > > > > From: Leon Romanovsky <leon@kernel.org>
> > > > > > Sent: Thursday, January 7, 2021 12:02 AM
> > > > > >
> > > > > > On Wed, Jan 06, 2021 at 11:23:39AM -0400, Jason Gunthorpe wrote:
> > > > > > > On Wed, Jan 06, 2021 at 12:40:17PM +0200, Leon Romanovsky wrote:
> > > > > > >
> > > > > > > > I asked what will you do when QEMU will gain needed functionality?
> > > > > > > > Will you remove QEMU from this list? If yes, how such "new" kernel
> > > > will
> > > > > > > > work on old QEMU versions?
> > > > > > >
> > > > > > > The needed functionality is some VMM hypercall, so presumably new
> > > > > > > kernels that support calling this hypercall will be able to discover
> > > > > > > if the VMM hypercall exists and if so superceed this entire check.
> > > > > >
> > > > > > Let's not speculate, do we have well-known path?
> > > > > > Will such patch be taken to stable@/distros?
> > > > > >
> > > > >
> > > > > There are two functions introduced in this patch. One is to detect whether
> > > > > running on bare metal or in a virtual machine. The other is for deciding
> > > > > whether the platform supports ims. Currently the two are identical because
> > > > > ims is supported only on bare metal at current stage. In the future it will
> > > > look
> > > > > like below when ims can be enabled in a VM:
> > > > >
> > > > > bool arch_support_pci_device_ims(struct pci_dev *pdev)
> > > > > {
> > > > > 	return on_bare_metal() || hypercall_irq_domain_supported();
> > > > > }
> > > > >
> > > > > The VMM vendor list is for on_bare_metal, and suppose a vendor will
> > > > > never be removed once being added to the list since the fact of running
> > > > > in a VM never changes, regardless of whether this hypervisor supports
> > > > > extra VMM hypercalls.
> > > >
> > > > This is what I imagined, this list will be forever, and this worries me.
> > > >
> > > > I don't know if it is true or not, but guess that at least Oracle and
> > > > Microsoft bare metal devices and VMs will have same DMI_SYS_VENDOR.
> > >
> > > It's true. David Woodhouse also said it's the case for Amazon EC2 instances.
> > >
> > > >
> > > > It means that this on_bare_metal() function won't work reliably in many
> > > > cases. Also being part of include/linux/msi.h, at some point of time,
> > > > this function will be picked by the users outside for the non-IMS cases.
> > > >
> > > > I didn't even mention custom forks of QEMU which are prohibited to change
> > > > DMI_SYS_VENDOR and private clouds with custom solutions.
> > >
> > > In this case the private QEMU forks are encouraged to set CPUID (X86_
> > > FEATURE_HYPERVISOR) if they do plan to adopt a different vendor name.
> >
> > Does QEMU set this bit when it runs in host-passthrough CPU model?
> >
> > >
> > > >
> > > > The current array makes DMI_SYS_VENDOR interface as some sort of ABI. If
> > > > in the future,
> > > > the QEMU will decide to use more hipster name, for example "qEmU", this
> > > > function
> > > > won't work.
> > > >
> > > > I'm aware that DMI_SYS_VENDOR is used heavily in the kernel code and
> > > > various names for the same company are good example how not reliable it.
> > > >
> > > > The most hilarious example is "Dell/Dell Inc./Dell Inc/Dell Computer
> > > > Corporation/Dell Computer",
> > > > but other companies are not far from them.
> > > >
> > > > Luckily enough, this identification is used for hardware product that
> > > > was released to the market and their name will be stable for that
> > > > specific model. It is not the case here where we need to ensure future
> > > > compatibility too (old kernel on new VM emulator).
> > > >
> > > > I'm not in position to say yes or no to this patch and don't have plans to do it.
> > > > Just expressing my feeling that this solution is too hacky for my taste.
> > > >
> > >
> > > I agree with your worries and solely relying on DMI_SYS_VENDOR is
> > > definitely too hacky. In previous discussions with Thomas there is no
> > > elegant way to handle this situation. It has to be a heuristic approach.
> > > First we hope the CPUID bit is set properly in most cases thus is checked
> > > first. Then other heuristics can be made for the remaining cases. DMI_
> > > SYS_VENDOR is the first hint and more can be added later. For example,
> > > when IOMMU is present there is vendor specific way to detect whether
> > > it's real or virtual. Dave also mentioned some BIOS flag to indicate a
> > > virtual machine. Now probably the real question here is whether people
> > > are OK with CPUID+DMI_SYS_VENDOR combo check for now (and grow
> > > it later) or prefer to having all identified heuristics so far in-place together...
> >
> > IMHO, it should be as much as possible close to the end result.
>
> Okay! This seems to be a right way to go.
>
> The SMBIOS defines a 'virtual machine' bit in the BIOS characteristics
> extension byte. It could be used as a possible way.
>
> In order to support emulated IOMMU for fully virtualized guest, the
> iommu vendors defined methods to distinguish between bare metal and VMM
> (caching mode in VT-d for example).
>
> I will go ahead with adding above two methods before checking the block
> list.

I still curious to hear an answer on my question above:
"Does QEMU set this bit when it runs in host-passthrough CPU model?"

In this mode QEMU will use host CPU without modifications.
Will it be marked as bare metal or VMM?

Thanks

>
> Best regards,
> baolu
Tian, Kevin Jan. 12, 2021, 6:59 a.m. UTC | #13
> From: Leon Romanovsky <leon@kernel.org>
> Sent: Tuesday, January 12, 2021 1:53 PM
> 
> On Tue, Jan 12, 2021 at 01:17:11PM +0800, Lu Baolu wrote:
> > Hi,
> >
> > On 1/7/21 3:16 PM, Leon Romanovsky wrote:
> > > On Thu, Jan 07, 2021 at 06:55:16AM +0000, Tian, Kevin wrote:
> > > > > From: Leon Romanovsky <leon@kernel.org>
> > > > > Sent: Thursday, January 7, 2021 2:09 PM
> > > > >
> > > > > On Thu, Jan 07, 2021 at 02:04:29AM +0000, Tian, Kevin wrote:
> > > > > > > From: Leon Romanovsky <leon@kernel.org>
> > > > > > > Sent: Thursday, January 7, 2021 12:02 AM
> > > > > > >
> > > > > > > On Wed, Jan 06, 2021 at 11:23:39AM -0400, Jason Gunthorpe
> wrote:
> > > > > > > > On Wed, Jan 06, 2021 at 12:40:17PM +0200, Leon Romanovsky
> wrote:
> > > > > > > >
> > > > > > > > > I asked what will you do when QEMU will gain needed
> functionality?
> > > > > > > > > Will you remove QEMU from this list? If yes, how such "new"
> kernel
> > > > > will
> > > > > > > > > work on old QEMU versions?
> > > > > > > >
> > > > > > > > The needed functionality is some VMM hypercall, so presumably
> new
> > > > > > > > kernels that support calling this hypercall will be able to discover
> > > > > > > > if the VMM hypercall exists and if so superceed this entire check.
> > > > > > >
> > > > > > > Let's not speculate, do we have well-known path?
> > > > > > > Will such patch be taken to stable@/distros?
> > > > > > >
> > > > > >
> > > > > > There are two functions introduced in this patch. One is to detect
> whether
> > > > > > running on bare metal or in a virtual machine. The other is for
> deciding
> > > > > > whether the platform supports ims. Currently the two are identical
> because
> > > > > > ims is supported only on bare metal at current stage. In the future it
> will
> > > > > look
> > > > > > like below when ims can be enabled in a VM:
> > > > > >
> > > > > > bool arch_support_pci_device_ims(struct pci_dev *pdev)
> > > > > > {
> > > > > > 	return on_bare_metal() ||
> hypercall_irq_domain_supported();
> > > > > > }
> > > > > >
> > > > > > The VMM vendor list is for on_bare_metal, and suppose a vendor
> will
> > > > > > never be removed once being added to the list since the fact of
> running
> > > > > > in a VM never changes, regardless of whether this hypervisor
> supports
> > > > > > extra VMM hypercalls.
> > > > >
> > > > > This is what I imagined, this list will be forever, and this worries me.
> > > > >
> > > > > I don't know if it is true or not, but guess that at least Oracle and
> > > > > Microsoft bare metal devices and VMs will have same
> DMI_SYS_VENDOR.
> > > >
> > > > It's true. David Woodhouse also said it's the case for Amazon EC2
> instances.
> > > >
> > > > >
> > > > > It means that this on_bare_metal() function won't work reliably in
> many
> > > > > cases. Also being part of include/linux/msi.h, at some point of time,
> > > > > this function will be picked by the users outside for the non-IMS cases.
> > > > >
> > > > > I didn't even mention custom forks of QEMU which are prohibited to
> change
> > > > > DMI_SYS_VENDOR and private clouds with custom solutions.
> > > >
> > > > In this case the private QEMU forks are encouraged to set CPUID (X86_
> > > > FEATURE_HYPERVISOR) if they do plan to adopt a different vendor
> name.
> > >
> > > Does QEMU set this bit when it runs in host-passthrough CPU model?
> > >
> > > >
> > > > >
> > > > > The current array makes DMI_SYS_VENDOR interface as some sort of
> ABI. If
> > > > > in the future,
> > > > > the QEMU will decide to use more hipster name, for example "qEmU",
> this
> > > > > function
> > > > > won't work.
> > > > >
> > > > > I'm aware that DMI_SYS_VENDOR is used heavily in the kernel code
> and
> > > > > various names for the same company are good example how not
> reliable it.
> > > > >
> > > > > The most hilarious example is "Dell/Dell Inc./Dell Inc/Dell Computer
> > > > > Corporation/Dell Computer",
> > > > > but other companies are not far from them.
> > > > >
> > > > > Luckily enough, this identification is used for hardware product that
> > > > > was released to the market and their name will be stable for that
> > > > > specific model. It is not the case here where we need to ensure future
> > > > > compatibility too (old kernel on new VM emulator).
> > > > >
> > > > > I'm not in position to say yes or no to this patch and don't have plans
> to do it.
> > > > > Just expressing my feeling that this solution is too hacky for my taste.
> > > > >
> > > >
> > > > I agree with your worries and solely relying on DMI_SYS_VENDOR is
> > > > definitely too hacky. In previous discussions with Thomas there is no
> > > > elegant way to handle this situation. It has to be a heuristic approach.
> > > > First we hope the CPUID bit is set properly in most cases thus is checked
> > > > first. Then other heuristics can be made for the remaining cases. DMI_
> > > > SYS_VENDOR is the first hint and more can be added later. For example,
> > > > when IOMMU is present there is vendor specific way to detect whether
> > > > it's real or virtual. Dave also mentioned some BIOS flag to indicate a
> > > > virtual machine. Now probably the real question here is whether people
> > > > are OK with CPUID+DMI_SYS_VENDOR combo check for now (and grow
> > > > it later) or prefer to having all identified heuristics so far in-place
> together...
> > >
> > > IMHO, it should be as much as possible close to the end result.
> >
> > Okay! This seems to be a right way to go.
> >
> > The SMBIOS defines a 'virtual machine' bit in the BIOS characteristics
> > extension byte. It could be used as a possible way.
> >
> > In order to support emulated IOMMU for fully virtualized guest, the
> > iommu vendors defined methods to distinguish between bare metal and
> VMM
> > (caching mode in VT-d for example).
> >
> > I will go ahead with adding above two methods before checking the block
> > list.
> 
> I still curious to hear an answer on my question above:
> "Does QEMU set this bit when it runs in host-passthrough CPU model?"

Yes, the bit is also set in this model.

Thanks
Kevin
Leon Romanovsky Jan. 12, 2021, 7:34 a.m. UTC | #14
On Tue, Jan 12, 2021 at 06:59:35AM +0000, Tian, Kevin wrote:
> > From: Leon Romanovsky <leon@kernel.org>
> > Sent: Tuesday, January 12, 2021 1:53 PM
> >
> > On Tue, Jan 12, 2021 at 01:17:11PM +0800, Lu Baolu wrote:
> > > Hi,
> > >
> > > On 1/7/21 3:16 PM, Leon Romanovsky wrote:
> > > > On Thu, Jan 07, 2021 at 06:55:16AM +0000, Tian, Kevin wrote:
> > > > > > From: Leon Romanovsky <leon@kernel.org>
> > > > > > Sent: Thursday, January 7, 2021 2:09 PM
> > > > > >
> > > > > > On Thu, Jan 07, 2021 at 02:04:29AM +0000, Tian, Kevin wrote:
> > > > > > > > From: Leon Romanovsky <leon@kernel.org>
> > > > > > > > Sent: Thursday, January 7, 2021 12:02 AM
> > > > > > > >
> > > > > > > > On Wed, Jan 06, 2021 at 11:23:39AM -0400, Jason Gunthorpe
> > wrote:
> > > > > > > > > On Wed, Jan 06, 2021 at 12:40:17PM +0200, Leon Romanovsky
> > wrote:
> > > > > > > > >
> > > > > > > > > > I asked what will you do when QEMU will gain needed
> > functionality?
> > > > > > > > > > Will you remove QEMU from this list? If yes, how such "new"
> > kernel
> > > > > > will
> > > > > > > > > > work on old QEMU versions?
> > > > > > > > >
> > > > > > > > > The needed functionality is some VMM hypercall, so presumably
> > new
> > > > > > > > > kernels that support calling this hypercall will be able to discover
> > > > > > > > > if the VMM hypercall exists and if so superceed this entire check.
> > > > > > > >
> > > > > > > > Let's not speculate, do we have well-known path?
> > > > > > > > Will such patch be taken to stable@/distros?
> > > > > > > >
> > > > > > >
> > > > > > > There are two functions introduced in this patch. One is to detect
> > whether
> > > > > > > running on bare metal or in a virtual machine. The other is for
> > deciding
> > > > > > > whether the platform supports ims. Currently the two are identical
> > because
> > > > > > > ims is supported only on bare metal at current stage. In the future it
> > will
> > > > > > look
> > > > > > > like below when ims can be enabled in a VM:
> > > > > > >
> > > > > > > bool arch_support_pci_device_ims(struct pci_dev *pdev)
> > > > > > > {
> > > > > > > 	return on_bare_metal() ||
> > hypercall_irq_domain_supported();
> > > > > > > }
> > > > > > >
> > > > > > > The VMM vendor list is for on_bare_metal, and suppose a vendor
> > will
> > > > > > > never be removed once being added to the list since the fact of
> > running
> > > > > > > in a VM never changes, regardless of whether this hypervisor
> > supports
> > > > > > > extra VMM hypercalls.
> > > > > >
> > > > > > This is what I imagined, this list will be forever, and this worries me.
> > > > > >
> > > > > > I don't know if it is true or not, but guess that at least Oracle and
> > > > > > Microsoft bare metal devices and VMs will have same
> > DMI_SYS_VENDOR.
> > > > >
> > > > > It's true. David Woodhouse also said it's the case for Amazon EC2
> > instances.
> > > > >
> > > > > >
> > > > > > It means that this on_bare_metal() function won't work reliably in
> > many
> > > > > > cases. Also being part of include/linux/msi.h, at some point of time,
> > > > > > this function will be picked by the users outside for the non-IMS cases.
> > > > > >
> > > > > > I didn't even mention custom forks of QEMU which are prohibited to
> > change
> > > > > > DMI_SYS_VENDOR and private clouds with custom solutions.
> > > > >
> > > > > In this case the private QEMU forks are encouraged to set CPUID (X86_
> > > > > FEATURE_HYPERVISOR) if they do plan to adopt a different vendor
> > name.
> > > >
> > > > Does QEMU set this bit when it runs in host-passthrough CPU model?
> > > >
> > > > >
> > > > > >
> > > > > > The current array makes DMI_SYS_VENDOR interface as some sort of
> > ABI. If
> > > > > > in the future,
> > > > > > the QEMU will decide to use more hipster name, for example "qEmU",
> > this
> > > > > > function
> > > > > > won't work.
> > > > > >
> > > > > > I'm aware that DMI_SYS_VENDOR is used heavily in the kernel code
> > and
> > > > > > various names for the same company are good example how not
> > reliable it.
> > > > > >
> > > > > > The most hilarious example is "Dell/Dell Inc./Dell Inc/Dell Computer
> > > > > > Corporation/Dell Computer",
> > > > > > but other companies are not far from them.
> > > > > >
> > > > > > Luckily enough, this identification is used for hardware product that
> > > > > > was released to the market and their name will be stable for that
> > > > > > specific model. It is not the case here where we need to ensure future
> > > > > > compatibility too (old kernel on new VM emulator).
> > > > > >
> > > > > > I'm not in position to say yes or no to this patch and don't have plans
> > to do it.
> > > > > > Just expressing my feeling that this solution is too hacky for my taste.
> > > > > >
> > > > >
> > > > > I agree with your worries and solely relying on DMI_SYS_VENDOR is
> > > > > definitely too hacky. In previous discussions with Thomas there is no
> > > > > elegant way to handle this situation. It has to be a heuristic approach.
> > > > > First we hope the CPUID bit is set properly in most cases thus is checked
> > > > > first. Then other heuristics can be made for the remaining cases. DMI_
> > > > > SYS_VENDOR is the first hint and more can be added later. For example,
> > > > > when IOMMU is present there is vendor specific way to detect whether
> > > > > it's real or virtual. Dave also mentioned some BIOS flag to indicate a
> > > > > virtual machine. Now probably the real question here is whether people
> > > > > are OK with CPUID+DMI_SYS_VENDOR combo check for now (and grow
> > > > > it later) or prefer to having all identified heuristics so far in-place
> > together...
> > > >
> > > > IMHO, it should be as much as possible close to the end result.
> > >
> > > Okay! This seems to be a right way to go.
> > >
> > > The SMBIOS defines a 'virtual machine' bit in the BIOS characteristics
> > > extension byte. It could be used as a possible way.
> > >
> > > In order to support emulated IOMMU for fully virtualized guest, the
> > > iommu vendors defined methods to distinguish between bare metal and
> > VMM
> > > (caching mode in VT-d for example).
> > >
> > > I will go ahead with adding above two methods before checking the block
> > > list.
> >
> > I still curious to hear an answer on my question above:
> > "Does QEMU set this bit when it runs in host-passthrough CPU model?"
>
> Yes, the bit is also set in this model.

Great, thanks.

>
> Thanks
> Kevin
diff mbox series

Patch

diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index 3507f456fcd0..963e0401f2b2 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -724,3 +724,50 @@  struct pci_dev *pci_real_dma_dev(struct pci_dev *dev)
 	return dev;
 }
 #endif
+
+/*
+ * We want to figure out which context we are running in. But the hardware
+ * does not introduce a reliable way (instruction, CPUID leaf, MSR, whatever)
+ * which can be manipulated by the VMM to let the OS figure out where it runs.
+ * So we go with the below probably on_bare_metal() function as a replacement
+ * for definitely on_bare_metal() to go forward only for the very simple reason
+ * that this is the only option we have.
+ *
+ * People might use the same vendor name for both bare metal and virtual
+ * environment. We can remove those names once we have vendor specific code to
+ * distinguish between them.
+ */
+static const char * const vmm_vendor_name[] = {
+	"QEMU", "Bochs", "KVM", "Xen", "VMware", "VMW", "VMware Inc.",
+	"innotek GmbH", "Oracle Corporation", "Parallels", "BHYVE",
+	"Microsoft Corporation", "Amazon EC2"
+};
+
+static bool on_bare_metal(void)
+{
+	int i;
+
+	if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
+		return false;
+
+	for (i = 0; i < ARRAY_SIZE(vmm_vendor_name); i++)
+		if (dmi_match(DMI_SYS_VENDOR, vmm_vendor_name[i]))
+			return false;
+
+	pr_info("System running on bare metal, report to bugzilla.kernel.org if not the case.");
+
+	return true;
+}
+
+bool arch_support_pci_device_ims(struct pci_dev *pdev)
+{
+	/*
+	 * When we are running in a VMM context, the device IMS could only be
+	 * enabled when the underlying hardware supports interrupt isolation
+	 * of the subdevice, or any mechanism (trap, hypercall) is added so
+	 * that changes in the interrupt message store could be managed by the
+	 * VMM. For now, we only support the device IMS when we are running on
+	 * the bare metal.
+	 */
+	return on_bare_metal();
+}
diff --git a/drivers/base/platform-msi.c b/drivers/base/platform-msi.c
index 8432a1bf4e28..88e5fe4dae67 100644
--- a/drivers/base/platform-msi.c
+++ b/drivers/base/platform-msi.c
@@ -512,6 +512,11 @@  struct irq_domain *device_msi_create_irq_domain(struct fwnode_handle *fn,
 #ifdef CONFIG_PCI
 #include <linux/pci.h>
 
+bool __weak arch_support_pci_device_ims(struct pci_dev *pdev)
+{
+	return false;
+}
+
 /**
  * pci_subdevice_msi_create_irq_domain - Create an irq domain for subdevices
  * @pdev:	Pointer to PCI device for which the subdevice domain is created
@@ -523,6 +528,9 @@  struct irq_domain *pci_subdevice_msi_create_irq_domain(struct pci_dev *pdev,
 	struct irq_domain *domain, *pdev_msi;
 	struct fwnode_handle *fn;
 
+	if (!arch_support_pci_device_ims(pdev))
+		return NULL;
+
 	/*
 	 * Retrieve the MSI domain of the underlying PCI device's MSI
 	 * domain. The PCI device domain's parent domain is also the parent
diff --git a/include/linux/msi.h b/include/linux/msi.h
index f319d7c6a4ef..6fda81c4b859 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -478,6 +478,7 @@  struct irq_domain *device_msi_create_irq_domain(struct fwnode_handle *fn,
 						struct irq_domain *parent);
 
 # ifdef CONFIG_PCI
+bool arch_support_pci_device_ims(struct pci_dev *pdev);
 struct irq_domain *pci_subdevice_msi_create_irq_domain(struct pci_dev *pdev,
 						       struct msi_domain_info *info);
 # endif