diff mbox series

[v5,06/14] x86/ioremap: Support hypervisor specified range to map as encrypted

Message ID 1673559753-94403-7-git-send-email-mikelley@microsoft.com (mailing list archive)
State Not Applicable
Headers show
Series Add PCI pass-thru support to Hyper-V Confidential VMs | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Michael Kelley (LINUX) Jan. 12, 2023, 9:42 p.m. UTC
In a AMD SEV-SNP VM using vTOM, devices in MMIO space may be provided by
the paravisor and need to be mapped as encrypted.  Provide a function
for the hypervisor to specify the address range for such devices.
In __ioremap_caller(), map addresses in this range as encrypted.

Only a single range is supported. If multiple devices need to be
mapped encrypted, the paravisor must place them within the single
contiguous range.

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
---
 arch/x86/include/asm/io.h |  2 ++
 arch/x86/mm/ioremap.c     | 27 ++++++++++++++++++++++++++-
 2 files changed, 28 insertions(+), 1 deletion(-)

Comments

Borislav Petkov Jan. 20, 2023, 8:15 p.m. UTC | #1
On Thu, Jan 12, 2023 at 01:42:25PM -0800, Michael Kelley wrote:
> In a AMD SEV-SNP VM using vTOM, devices in MMIO space may be provided by
> the paravisor and need to be mapped as encrypted.  Provide a function
> for the hypervisor to specify the address range for such devices.
> In __ioremap_caller(), map addresses in this range as encrypted.
> 
> Only a single range is supported. If multiple devices need to be
> mapped encrypted, the paravisor must place them within the single
> contiguous range.

This already is starting to sound insufficient and hacky. And it also makes
CC_ATTR_ACCESS_IOAPIC_ENCRYPTED insufficient either.

So, the situation we have is, we're a SEV-SNP VM using vTOM. Which means,
MSR_AMD64_SEV[3] = 1. Or SEV_FEATURES[1], alternatively - same thing.

That MSR cannot be intercepted by the HV and we use it extensively in Linux when
it runs as a SEV-* guest. And I had asked this before, during review, but why
aren't you checking this bit above when you wanna do vTOM-specific work?

Because then you can do that check and

1. map the IO-APIC encrypted
2. map MMIO space of devices from the driver encrypted too
3. ...

and so on.

And you won't need those other, not as nice things...

Hmmm.
Michael Kelley (LINUX) Jan. 21, 2023, 4:10 a.m. UTC | #2
From: Borislav Petkov <bp@alien8.de> Sent: Friday, January 20, 2023 12:15 PM
> 
> On Thu, Jan 12, 2023 at 01:42:25PM -0800, Michael Kelley wrote:
> > In a AMD SEV-SNP VM using vTOM, devices in MMIO space may be provided by
> > the paravisor and need to be mapped as encrypted.  Provide a function
> > for the hypervisor to specify the address range for such devices.
> > In __ioremap_caller(), map addresses in this range as encrypted.
> >
> > Only a single range is supported. If multiple devices need to be
> > mapped encrypted, the paravisor must place them within the single
> > contiguous range.
> 
> This already is starting to sound insufficient and hacky. And it also makes
> CC_ATTR_ACCESS_IOAPIC_ENCRYPTED insufficient either.
> 
> So, the situation we have is, we're a SEV-SNP VM using vTOM. Which means,
> MSR_AMD64_SEV[3] = 1. Or SEV_FEATURES[1], alternatively - same thing.
> 
> That MSR cannot be intercepted by the HV and we use it extensively in Linux when
> it runs as a SEV-* guest. And I had asked this before, during review, but why
> aren't you checking this bit above when you wanna do vTOM-specific work?

Per the commit message for 009767dbf42a, it's not safe to read MSR_AMD64_SEV
on all implementations of AMD processors.  CC_ATTR_ACCESS_IOAPIC_ENCRYPTED
is used in io_apic_set_fixmap(), which is called on all Intel/AMD systems without
any qualifications.   Even if rdmsrl_safe() works at this point in boot, I'm a little
leery of reading a new MSR in a path that essentially every x86 bare-metal system
or VM takes during boot.  Or am I being overly paranoid about old processor
models or hypervisor versions potentially doing something weird?

But in any case, the whole point of cc_platform_has() is to provide a level of
abstraction from the hardware registers, and it's fully safe to use on every x86
bare-metal system or VM.  And while I don't anticipate it now, maybe there's
some future scheme where a paravisor-like entity could be used with Intel
TDX.  It seems like using a cc_platform_has() abstraction is better than directly
accessing the MSR.

> 
> Because then you can do that check and
> 
> 1. map the IO-APIC encrypted
> 2. map MMIO space of devices from the driver encrypted too
> 3. ...

My resolution of the TPM driver issue is admittedly a work-around.   I think
of it as temporary in anticipation of future implementations of PCIe TDISP
hardware, which allows PCI devices to DMA directly into guest encrypted
memory.  TDISP also places the device's BAR values in an encrypted portion
of the GPA space. Assuming TDISP hardware comes along in the next couple
of years, Linux will need a robust way to deal with a mix of PCI devices
being in unencrypted and encrypted GPA space.  I don't know how a
specific device will be mapped correctly, but I hope it can happen in the
generic PCI code, and not by modifying each device driver.  It's probably
premature to build that robust mechanism now, but when it comes, my
work-around would be replaced.
 
With all that in mind, I don't want to modify the TPM driver to special-case
its MMIO space being encrypted.  FWIW, the TPM driver today uses
devm_ioremap_resource() to do the mapping, which defaults to mapping
decrypted except for the exceptions implemented in __ioremap_caller().
There's no devm_* option for specifying encrypted.  Handling decrypted
vs. encrypted in the driver would require extending the driver API to
provide an "encrypted" option, and that seems like going in the wrong
long-term direction.

Finally, I would have liked to handle the IO-APIC and the vTPM the same
way.  But the IO-APIC doesn't use the normal ioremap_* functions because
it's done early in boot via the fixmap.  I didn't see a reasonable way to
unify them.

Anyway, that's my thoughts.  I'm certainly open if someone has a better
way to do it.

Michael

> 
> and so on.
> 
> And you won't need those other, not as nice things...
> 
> Hmmm.
Borislav Petkov Jan. 25, 2023, 2:55 p.m. UTC | #3
On Sat, Jan 21, 2023 at 04:10:23AM +0000, Michael Kelley (LINUX) wrote:
> Per the commit message for 009767dbf42a, it's not safe to read MSR_AMD64_SEV
> on all implementations of AMD processors.

1. Why does that matter to you? This is Hygon.

2. The MSR access is behind CPUID check.

> CC_ATTR_ACCESS_IOAPIC_ENCRYPTED is used in io_apic_set_fixmap(), which is
> called on all Intel/AMD systems without any qualifications.   Even if
> rdmsrl_safe() works at this point in boot, I'm a little leery of reading a new
> MSR in a path that essentially every x86 bare-metal system or VM takes during
> boot.

You read the MSR once and cache it. sme_enable() already does that. I don't see
what the problem is.

> Or am I being overly paranoid about old processor models or hypervisor
> versions potentially doing something weird?

Yes, you are. :)

If they're doing something weird which is out of spec, then we'll deal with them
later.

> But in any case, the whole point of cc_platform_has() is to provide a level of
> abstraction from the hardware registers, and it's fully safe to use on every x86
> bare-metal system or VM.  And while I don't anticipate it now, maybe there's
> some future scheme where a paravisor-like entity could be used with Intel
> TDX.  It seems like using a cc_platform_has() abstraction is better than directly
> accessing the MSR.

That's fine but we're talking about this particular implementation and that is
vTOM-like with the address space split. If TDX does address space split later,
we can accomodate it too. (Although I think they are not interested in this).

And if you really want to use cc_platform_has(), we could do

	cc_platform_has(CC_ADDRESS_SPACE_SPLIT_ON_A_PARAVISOR)

or something with a better name.

The point is, you want to do things which are specific for this particular
implementation. If so, then check for it. Do not do hacky things which get the
work done for your case but can very easily be misused by others.

> My resolution of the TPM driver issue is admittedly a work-around.   I think
> of it as temporary in anticipation of future implementations of PCIe TDISP
> hardware, which allows PCI devices to DMA directly into guest encrypted
> memory.

Yap, that sounds real nice.

> TDISP also places the device's BAR values in an encrypted portion
> of the GPA space. Assuming TDISP hardware comes along in the next couple
> of years, Linux will need a robust way to deal with a mix of PCI devices
> being in unencrypted and encrypted GPA space.  I don't know how a
> specific device will be mapped correctly, but I hope it can happen in the
> generic PCI code, and not by modifying each device driver.

I guess those devices would advertize that capability somehow so that code can
query it and act accordingly.

> It's probably premature to build that robust mechanism now, but when it comes,
> my work-around would be replaced.

It would be replaced if it doesn't have any users. By the looks of it, it'll
soon grow others and then good luck removing it.

> With all that in mind, I don't want to modify the TPM driver to special-case
> its MMIO space being encrypted.  FWIW, the TPM driver today uses
> devm_ioremap_resource() to do the mapping, which defaults to mapping
> decrypted except for the exceptions implemented in __ioremap_caller().
> There's no devm_* option for specifying encrypted.

You mean, it is hard to add a DEVM_IOREMAP_ENCRYPTED type which will have
__devm_ioremap() call ioremap_encrypted()?

Or define a IORESOURCE_ENCRYPTED and pass it through the ioresource flags?

Why is that TPM driver so precious that it can be touched and the arch code
would have to accept hacks?

> Handling decrypted vs. encrypted in the driver would require extending the
> driver API to provide an "encrypted" option, and that seems like going in the
> wrong long-term direction.

Sorry, I can't follow here.
Michael Kelley (LINUX) Feb. 2, 2023, 5:49 a.m. UTC | #4
From: Borislav Petkov <bp@alien8.de> Sent: Wednesday, January 25, 2023 6:56 AM
> 
> On Sat, Jan 21, 2023 at 04:10:23AM +0000, Michael Kelley (LINUX) wrote:

[snip]

> 
> > But in any case, the whole point of cc_platform_has() is to provide a level of
> > abstraction from the hardware registers, and it's fully safe to use on every x86
> > bare-metal system or VM.  And while I don't anticipate it now, maybe there's
> > some future scheme where a paravisor-like entity could be used with Intel
> > TDX.  It seems like using a cc_platform_has() abstraction is better than directly
> > accessing the MSR.
> 
> That's fine but we're talking about this particular implementation and that is
> vTOM-like with the address space split. If TDX does address space split later,
> we can accomodate it too. (Although I think they are not interested in this).
> 
> And if you really want to use cc_platform_has(), we could do
> 
> 	cc_platform_has(CC_ADDRESS_SPACE_SPLIT_ON_A_PARAVISOR)
> 
> or something with a better name.

I do think it makes sense to use the cc_platform_has() abstraction.  It's
then a question of agreeing on how to name the attribute.  We've
discussed various approaches in different versions of this patch series:

v1 & v2:  CC_ATTR_HAS_PARAVISOR
v3:  CC_ATTR_EMULATED_IOAPIC
v4 & v5:  CC_ATTR_ACCESS_IOAPIC_ENCRYPTED

I could do:
1.  CC_ATTR_PARAVISOR_SPLIT_ADDRESS_SPACE, which is similar to
    what I had for v1 & v2.   At the time, somebody commented that
    this might be a bit too general.
2.  Keep CC_ATTR_ACCESS_IOAPIC_ENCRYPTED and add
    CC_ATTR_ACCESS_TPM_ENCRYPTED, which would decouple them
3.  CC_ATTR_ACCESS_IOAPIC_AND_TPM_ENCRYPTED, which is very
    narrow and specific.

I have weak preference for #1 above, but I could go with any of them.
What's your preference?

> > My resolution of the TPM driver issue is admittedly a work-around.   I think
> > of it as temporary in anticipation of future implementations of PCIe TDISP
> > hardware, which allows PCI devices to DMA directly into guest encrypted
> > memory.
> 
> Yap, that sounds real nice.
> 
> > TDISP also places the device's BAR values in an encrypted portion
> > of the GPA space. Assuming TDISP hardware comes along in the next couple
> > of years, Linux will need a robust way to deal with a mix of PCI devices
> > being in unencrypted and encrypted GPA space.  I don't know how a
> > specific device will be mapped correctly, but I hope it can happen in the
> > generic PCI code, and not by modifying each device driver.
> 
> I guess those devices would advertize that capability somehow so that code can
> query it and act accordingly.
> 
> > It's probably premature to build that robust mechanism now, but when it comes,
> > my work-around would be replaced.
> 
> It would be replaced if it doesn't have any users. By the looks of it, it'll
> soon grow others and then good luck removing it.
> 
> > With all that in mind, I don't want to modify the TPM driver to special-case
> > its MMIO space being encrypted.  FWIW, the TPM driver today uses
> > devm_ioremap_resource() to do the mapping, which defaults to mapping
> > decrypted except for the exceptions implemented in __ioremap_caller().
> > There's no devm_* option for specifying encrypted.
> 
> You mean, it is hard to add a DEVM_IOREMAP_ENCRYPTED type which will have
> __devm_ioremap() call ioremap_encrypted()?
> 
> Or define a IORESOURCE_ENCRYPTED and pass it through the ioresource flags?
> 
> Why is that TPM driver so precious that it can be touched and the arch code
> would have to accept hacks?
> 
> > Handling decrypted vs. encrypted in the driver would require extending the
> > driver API to provide an "encrypted" option, and that seems like going in the
> > wrong long-term direction.
> 
> Sorry, I can't follow here.
> 

For v6 of the patch series, I've coded devm_ioremap_resource_enc() to call
__devm_ioremap(), which then calls ioremap_encrypted().  I've updated the
TPM driver to use cc_platform_has() with whatever attribute name we agree
on to decide between devm_ioremap_resource_enc() and
devm_ioremap_resource().

If this approach is OK with the TPM driver maintainers, I'm good with it.
More robust handling of a mix of encrypted and decrypted devices can get
sorted out later.

Michael
Borislav Petkov Feb. 7, 2023, 12:41 p.m. UTC | #5
On Thu, Feb 02, 2023 at 05:49:44AM +0000, Michael Kelley (LINUX) wrote:
> I could do:
> 1.  CC_ATTR_PARAVISOR_SPLIT_ADDRESS_SPACE, which is similar to
>     what I had for v1 & v2.   At the time, somebody commented that
>     this might be a bit too general.
> 2.  Keep CC_ATTR_ACCESS_IOAPIC_ENCRYPTED and add
>     CC_ATTR_ACCESS_TPM_ENCRYPTED, which would decouple them
> 3.  CC_ATTR_ACCESS_IOAPIC_AND_TPM_ENCRYPTED, which is very
>     narrow and specific.
> 
> I have weak preference for #1 above, but I could go with any of them.
> What's your preference?

Either 1. but a shorter name or something which works with the TDX side
too.

Or are there no similar TDX solutions planned where the guest runs
unmodified and under a paravisor?

> For v6 of the patch series, I've coded devm_ioremap_resource_enc() to call
> __devm_ioremap(), which then calls ioremap_encrypted().  I've updated the
> TPM driver to use cc_platform_has() with whatever attribute name we agree
> on to decide between devm_ioremap_resource_enc() and
> devm_ioremap_resource().
> 
> If this approach is OK with the TPM driver maintainers, I'm good with it.
> More robust handling of a mix of encrypted and decrypted devices can get
> sorted out later.

Makes sense to me...

Thx.
Michael Kelley (LINUX) Feb. 7, 2023, 7:01 p.m. UTC | #6
From: Borislav Petkov <bp@alien8.de> Sent: Tuesday, February 7, 2023 4:41 AM
> 
> On Thu, Feb 02, 2023 at 05:49:44AM +0000, Michael Kelley (LINUX) wrote:
> > I could do:
> > 1.  CC_ATTR_PARAVISOR_SPLIT_ADDRESS_SPACE, which is similar to
> >     what I had for v1 & v2.   At the time, somebody commented that
> >     this might be a bit too general.
> > 2.  Keep CC_ATTR_ACCESS_IOAPIC_ENCRYPTED and add
> >     CC_ATTR_ACCESS_TPM_ENCRYPTED, which would decouple them
> > 3.  CC_ATTR_ACCESS_IOAPIC_AND_TPM_ENCRYPTED, which is very
> >     narrow and specific.
> >
> > I have weak preference for #1 above, but I could go with any of them.
> > What's your preference?
> 
> Either 1. but a shorter name or something which works with the TDX side
> too.

Unless there are objections, I'll go with CC_ATTR_PARAVISOR_DEVICES,
which is shorter.  The full details of the meaning will be in a comment
where this is defined with all the other CC_ATTR_* values.

> 
> Or are there no similar TDX solutions planned where the guest runs
> unmodified and under a paravisor?

The TDX plans are still being sorted out.  But if we end up with such
an approach, CC_ATTR_PARAVISOR_DEVICES will be correct for TDX
also.

Michael

> 
> > For v6 of the patch series, I've coded devm_ioremap_resource_enc() to call
> > __devm_ioremap(), which then calls ioremap_encrypted().  I've updated the
> > TPM driver to use cc_platform_has() with whatever attribute name we agree
> > on to decide between devm_ioremap_resource_enc() and
> > devm_ioremap_resource().
> >
> > If this approach is OK with the TPM driver maintainers, I'm good with it.
> > More robust handling of a mix of encrypted and decrypted devices can get
> > sorted out later.
> 
> Makes sense to me...
> 
> Thx.
Borislav Petkov Feb. 7, 2023, 7:33 p.m. UTC | #7
On Tue, Feb 07, 2023 at 07:01:25PM +0000, Michael Kelley (LINUX) wrote:
> Unless there are objections, I'll go with CC_ATTR_PARAVISOR_DEVICES,

What does "DEVICES" mean in this context?

You need to think about !virt people too who are already confused by the
word "paravisor". :-)
Michael Kelley (LINUX) Feb. 7, 2023, 7:48 p.m. UTC | #8
From: Borislav Petkov <bp@alien8.de> Sent: Tuesday, February 7, 2023 11:33 AM
> 
> On Tue, Feb 07, 2023 at 07:01:25PM +0000, Michael Kelley (LINUX) wrote:
> > Unless there are objections, I'll go with CC_ATTR_PARAVISOR_DEVICES,
> 
> What does "DEVICES" mean in this context?
> 
> You need to think about !virt people too who are already confused by the
> word "paravisor". :-)
> 

Maybe I misunderstood your previous comment about "Either 1".   We can
avoid "PARAVISOR" entirely by going with two attributes:

CC_ATTR_ACCESS_IOAPIC_ENCRYPTED
CC_ATTR_ACCESS_TPM_ENCRYPTED

These are much more specific, and relatively short, and having two allows
decoupling the handling of the IO-APIC and TPM.  Combining into the single

CC_ATTR_ACCESS_IOAPIC_AND_TPM_ENCRYPTED

also works but is longer.

Capturing the full meaning in the string names is probably impossible.
Referring to the comment for the definition will be required for a full
understanding.

Michael
Borislav Petkov Feb. 7, 2023, 7:54 p.m. UTC | #9
On Tue, Feb 07, 2023 at 07:48:06PM +0000, Michael Kelley (LINUX) wrote:
> From: Borislav Petkov <bp@alien8.de> Sent: Tuesday, February 7, 2023 11:33 AM
> > 
> > On Tue, Feb 07, 2023 at 07:01:25PM +0000, Michael Kelley (LINUX) wrote:
> > > Unless there are objections, I'll go with CC_ATTR_PARAVISOR_DEVICES,
> > 
> > What does "DEVICES" mean in this context?
> > 
> > You need to think about !virt people too who are already confused by the
> > word "paravisor". :-)
> > 
> 
> Maybe I misunderstood your previous comment about "Either 1".   We can
> avoid "PARAVISOR" entirely by going with two attributes:

No, I'm fine with CC_ATTR_PARAVISOR. Why would you have to have
CC_ATTR_PARAVISOR_DEVICES? I.e., the string "_DEVICES" appended after
"PARAVISOR". Isn't CC_ATTR_PARAVISOR enough?
Michael Kelley (LINUX) Feb. 7, 2023, 7:57 p.m. UTC | #10
From: Borislav Petkov <bp@alien8.de> Sent: Tuesday, February 7, 2023 11:55 AM
> 
> On Tue, Feb 07, 2023 at 07:48:06PM +0000, Michael Kelley (LINUX) wrote:
> > From: Borislav Petkov <bp@alien8.de> Sent: Tuesday, February 7, 2023 11:33 AM
> > >
> > > On Tue, Feb 07, 2023 at 07:01:25PM +0000, Michael Kelley (LINUX) wrote:
> > > > Unless there are objections, I'll go with CC_ATTR_PARAVISOR_DEVICES,
> > >
> > > What does "DEVICES" mean in this context?
> > >
> > > You need to think about !virt people too who are already confused by the
> > > word "paravisor". :-)
> > >
> >
> > Maybe I misunderstood your previous comment about "Either 1".   We can
> > avoid "PARAVISOR" entirely by going with two attributes:
> 
> No, I'm fine with CC_ATTR_PARAVISOR. Why would you have to have
> CC_ATTR_PARAVISOR_DEVICES? I.e., the string "_DEVICES" appended after
> "PARAVISOR". Isn't CC_ATTR_PARAVISOR enough?
> 

Works for me. :-)

Michael
Michael Kelley (LINUX) Feb. 8, 2023, 12:18 a.m. UTC | #11
From: Borislav Petkov <bp@alien8.de> Sent: Tuesday, February 7, 2023 11:55 AM
> 
> On Tue, Feb 07, 2023 at 07:48:06PM +0000, Michael Kelley (LINUX) wrote:
> > From: Borislav Petkov <bp@alien8.de> Sent: Tuesday, February 7, 2023 11:33 AM
> > >
> > > On Tue, Feb 07, 2023 at 07:01:25PM +0000, Michael Kelley (LINUX) wrote:
> > > > Unless there are objections, I'll go with CC_ATTR_PARAVISOR_DEVICES,
> > >
> > > What does "DEVICES" mean in this context?
> > >
> > > You need to think about !virt people too who are already confused by the
> > > word "paravisor". :-)
> > >
> >
> > Maybe I misunderstood your previous comment about "Either 1".   We can
> > avoid "PARAVISOR" entirely by going with two attributes:
> 
> No, I'm fine with CC_ATTR_PARAVISOR. Why would you have to have
> CC_ATTR_PARAVISOR_DEVICES? I.e., the string "_DEVICES" appended after
> "PARAVISOR". Isn't CC_ATTR_PARAVISOR enough?
> 

Dave --

In v2 of this patch series, you had concerns about CC_ATTR_PARAVISOR being too
generic. [1]   After some back-and-forth discussion in this thread, Boris is back to
preferring it.   Can you live with CC_ATTR_PARAVISOR?  Just trying to reach
consensus ... 

Michael

[1]   https://lore.kernel.org/linux-hyperv/Y258BO8ohVtVZvSH@liuwe-devbox-debian-v2/T/#m593853d8094453ad3f1a5552dad995ccc6c019b2
Dave Hansen Feb. 8, 2023, 3:09 p.m. UTC | #12
On 2/7/23 16:18, Michael Kelley (LINUX) wrote:
> In v2 of this patch series, you had concerns about CC_ATTR_PARAVISOR being too
> generic. [1]   After some back-and-forth discussion in this thread, Boris is back to
> preferring it.   Can you live with CC_ATTR_PARAVISOR?  Just trying to reach
> consensus ... 

I still think it's too generic.  Even the comment was trying to be too
generic:

> +	/**
> +	 * @CC_ATTR_HAS_PARAVISOR: Guest VM is running with a paravisor
> +	 *
> +	 * The platform/OS is running as a guest/virtual machine with
> +	 * a paravisor in VMPL0. Having a paravisor affects things
> +	 * like whether the I/O APIC is emulated and operates in the
> +	 * encrypted or decrypted portion of the guest physical address
> +	 * space.
> +	 *
> +	 * Examples include Hyper-V SEV-SNP guests using vTOM.
> +	 */
> +	CC_ATTR_HAS_PARAVISOR,

This doesn't help me figure out when I should use CC_ATTR_HAS_PARAVISOR
really at all.  It "operates in the encrypted or decrypted portion..."
Which one is it?  Should I be adding or removing encryption on the
mappings for paravisors?

That's opposed to:

> +	/**
> +	 * @CC_ATTR_ACCESS_IOAPIC_ENCRYPTED: Guest VM IO-APIC is encrypted
> +	 *
> +	 * The platform/OS is running as a guest/virtual machine with
> +	 * an IO-APIC that is emulated by a paravisor running in the
> +	 * guest VM context. As such, the IO-APIC is accessed in the
> +	 * encrypted portion of the guest physical address space.
> +	 *
> +	 * Examples include Hyper-V SEV-SNP guests using vTOM.
> +	 */
> +	CC_ATTR_ACCESS_IOAPIC_ENCRYPTED,

Which makes this code almost stupidly obvious:

> -	flags = pgprot_decrypted(flags);
> +	if (!cc_platform_has(CC_ATTR_ACCESS_IOAPIC_ENCRYPTED))
> +		flags = pgprot_decrypted(flags);

"Oh, if it's access is not encrypted, then get the decrypted version of
the flags."

Compare that to:

	if (!cc_platform_has(CC_ATTR_PARAVISOR))
		flags = pgprot_decrypted(flags);

Which is a big fat WTF.  Because a paravisor "operates in the encrypted
or decrypted portion..."  So is this if() condition correct or inverted?
It's utterly impossible to tell because of how generic the option is.

The only way to make sense of the generic thing is to do:

	/* Paravisors have a decrypted IO-APIC mapping: */
	if (!cc_platform_has(CC_ATTR_PARAVISOR))
		flags = pgprot_decrypted(flags);

at every site to state the assumption and make the connection between
paravisors and the behavior.  If you want to go do _that_, then fine by
me.  But, at that point, the naming is pretty worthless because you
could also have said "goldfish" instead of "paravisor" and it makes an
equal amount of sense:

	/* Goldfish have a decrypted IO-APIC mapping: */
	if (!cc_platform_has(CC_ATTR_GOLDFISH))
		flags = pgprot_decrypted(flags);

I get it, naming is hard.
Dave Hansen Feb. 8, 2023, 5:23 p.m. UTC | #13
On 2/7/23 04:41, Borislav Petkov wrote:
> Or are there no similar TDX solutions planned where the guest runs
> unmodified and under a paravisor?

I actually don't think paravisors make *ANY* sense for Linux.  If you
have to modify the guest, then just modify it to talk to the hypervisor
directly.  This code is... modifying the guest.  What does putting a
paravisor in the middle do for you?

It might help with binary drivers, but we don't do upstream kernel work
to make silly binary Linux drivers happy.

So, no, there's no similar TDX solutions planned, at least for Linux
guests.  Unless I missed the memo.  Kirill?
Michael Kelley (LINUX) Feb. 9, 2023, 5:29 p.m. UTC | #14
From: Dave Hansen <dave.hansen@intel.com> Sent: Wednesday, February 8, 2023 7:10 AM
> 
> On 2/7/23 16:18, Michael Kelley (LINUX) wrote:
> > In v2 of this patch series, you had concerns about CC_ATTR_PARAVISOR being too
> > generic. [1]   After some back-and-forth discussion in this thread, Boris is back to
> > preferring it.   Can you live with CC_ATTR_PARAVISOR?  Just trying to reach
> > consensus ...
> 
> I still think it's too generic.  Even the comment was trying to be too
> generic:
> 
> > +	/**
> > +	 * @CC_ATTR_HAS_PARAVISOR: Guest VM is running with a paravisor
> > +	 *
> > +	 * The platform/OS is running as a guest/virtual machine with
> > +	 * a paravisor in VMPL0. Having a paravisor affects things
> > +	 * like whether the I/O APIC is emulated and operates in the
> > +	 * encrypted or decrypted portion of the guest physical address
> > +	 * space.
> > +	 *
> > +	 * Examples include Hyper-V SEV-SNP guests using vTOM.
> > +	 */
> > +	CC_ATTR_HAS_PARAVISOR,
> 
> This doesn't help me figure out when I should use CC_ATTR_HAS_PARAVISOR
> really at all.  It "operates in the encrypted or decrypted portion..."
> Which one is it?  Should I be adding or removing encryption on the
> mappings for paravisors?
> 
> That's opposed to:
> 
> > +	/**
> > +	 * @CC_ATTR_ACCESS_IOAPIC_ENCRYPTED: Guest VM IO-APIC is encrypted
> > +	 *
> > +	 * The platform/OS is running as a guest/virtual machine with
> > +	 * an IO-APIC that is emulated by a paravisor running in the
> > +	 * guest VM context. As such, the IO-APIC is accessed in the
> > +	 * encrypted portion of the guest physical address space.
> > +	 *
> > +	 * Examples include Hyper-V SEV-SNP guests using vTOM.
> > +	 */
> > +	CC_ATTR_ACCESS_IOAPIC_ENCRYPTED,
> 
> Which makes this code almost stupidly obvious:
> 
> > -	flags = pgprot_decrypted(flags);
> > +	if (!cc_platform_has(CC_ATTR_ACCESS_IOAPIC_ENCRYPTED))
> > +		flags = pgprot_decrypted(flags);
> 
> "Oh, if it's access is not encrypted, then get the decrypted version of
> the flags."
> 
> Compare that to:
> 
> 	if (!cc_platform_has(CC_ATTR_PARAVISOR))
> 		flags = pgprot_decrypted(flags);
> 
> Which is a big fat WTF.  Because a paravisor "operates in the encrypted
> or decrypted portion..."  So is this if() condition correct or inverted?
> It's utterly impossible to tell because of how generic the option is.
> 
> The only way to make sense of the generic thing is to do:
> 
> 	/* Paravisors have a decrypted IO-APIC mapping: */
> 	if (!cc_platform_has(CC_ATTR_PARAVISOR))
> 		flags = pgprot_decrypted(flags);
> 
> at every site to state the assumption and make the connection between
> paravisors and the behavior.  If you want to go do _that_, then fine by
> me.  But, at that point, the naming is pretty worthless because you
> could also have said "goldfish" instead of "paravisor" and it makes an
> equal amount of sense:
> 
> 	/* Goldfish have a decrypted IO-APIC mapping: */
> 	if (!cc_platform_has(CC_ATTR_GOLDFISH))
> 		flags = pgprot_decrypted(flags);
> 
> I get it, naming is hard.

Boris --

Any further comments?  Trying to reach consensus.  A
solution aligned with Dave's arguments would keep the current
CC_ATTR_ACCESS_IOAPIC_ENCRYPTED, and add
CC_ATTR_ACCESS_TPM_ENCRYPTED to cover the TPM case,
which decouples the two.

Yes, naming is hard.  Reaching consensus on naming is even
harder.  :-)

Michael
Michael Kelley (LINUX) Feb. 9, 2023, 5:47 p.m. UTC | #15
From: Dave Hansen <dave.hansen@intel.com> Sent: Wednesday, February 8, 2023 9:24 AM
> 
> On 2/7/23 04:41, Borislav Petkov wrote:
> > Or are there no similar TDX solutions planned where the guest runs
> > unmodified and under a paravisor?
> 
> I actually don't think paravisors make *ANY* sense for Linux.  If you
> have to modify the guest, then just modify it to talk to the hypervisor
> directly.  This code is... modifying the guest.  What does putting a
> paravisor in the middle do for you?

One of the original goals of the paravisor was to make fewer
modifications to the guest, especially in areas that aren't directly related
to the hypervisor.  It's arguable as to whether that goal played out in
reality.

But another significant goal is to be able to move some device emulation
from the hypervisor/VMM to the guest context.  In a CoCo VM, this move
is from outside the TCB to inside the TCB.  A great example is a virtual
TPM.  Per the CoCo VM threat model, a guest can't rely on a vTPM
provided by the host.  But a guest *can* rely on a vTPM implemented in
a more privileged layer of the guest context.  With CoCo VMs in the
Azure public cloud, the paravisor also provides other device emulation, like
the IO-APIC to solve some of the ugly interrupt delivery issues.  In a
complete solution, it should be possible for a customer to provide his
own paravisor, or at least to inspect/audit the vendor-provided paravisor
code so that it can be certified against whatever security standards the
customer requires.  For Azure CoCo VMs, this part is a work-in-progress.

This could turn into an extended discussion, and I've given only a
fairly high-level answer.   There are architects at Microsoft who could
probably give a better rendition of why we've pursued the paravisor
approach with SEV-SNP guests.

Michael

> 
> It might help with binary drivers, but we don't do upstream kernel work
> to make silly binary Linux drivers happy.
> 
> So, no, there's no similar TDX solutions planned, at least for Linux
> guests.  Unless I missed the memo.  Kirill?
Sean Christopherson Feb. 10, 2023, 6:41 p.m. UTC | #16
Wearing my KVM hat and not my Google hat...

On Thu, Feb 09, 2023, Michael Kelley (LINUX) wrote:
> From: Dave Hansen <dave.hansen@intel.com> Sent: Wednesday, February 8, 2023 9:24 AM
> > 
> > On 2/7/23 04:41, Borislav Petkov wrote:
> > > Or are there no similar TDX solutions planned where the guest runs
> > > unmodified and under a paravisor?
> > 
> > I actually don't think paravisors make *ANY* sense for Linux.

I 100% agree, but Intel made what I think almost entirely irrelevant by refusing
to allow third party code to run in SEAM.

> > If you have to modify the guest, then just modify it to talk to the
> > hypervisor directly.  This code is... modifying the guest.  What does
> > putting a paravisor in the middle do for you?
> 
> One of the original goals of the paravisor was to make fewer
> modifications to the guest, especially in areas that aren't directly related
> to the hypervisor.  It's arguable as to whether that goal played out in
> reality.
> 
> But another significant goal is to be able to move some device emulation
> from the hypervisor/VMM to the guest context.  In a CoCo VM, this move
> is from outside the TCB to inside the TCB.  A great example is a virtual
> TPM.  Per the CoCo VM threat model, a guest can't rely on a vTPM
> provided by the host.

I vehemently disagree with this assertion.  It's kinda sorta true, but only
because Intel and AMD have gone down the road of not providing the mechanisms and
ability for the hypervisor to run and attest to the integrity, functionality, etc.
of (a subset of) the hypervisor's own code.

Taking SEAM/TDX as an example, if the code running in SEAM were an extension of
KVM instead of a hypervisor-agnostic nanny, then there would be no need for a
"paravisor" to provide a vTPM.  It would be very feasible to teach the SEAM-protected
bits of KVM to forward vTPM accesses to a host-provided, signed, attested, and open
source software running in a helper TD.

I fully realize you meant "untrusted host", but statements like "the host can't
be trusted" subconciously reinforce the, IMO, flawed model of hardware vendors
and _only_ hardware vendors providing the trusted bits.

The idea that firmware/software written by hardware vendors is somehow more
trustworthy than fully open source software is simultaneously laughable and
infuriating.  

Anyways, tying things back to the actual code being discussed, I vote against
CC_ATTR_PARAVISOR.  Being able to trust device emulation is not unique to a
paravisor.  A single flag also makes too many assumptions about what is trusted
and thus should be accessed encrypted.
Dave Hansen Feb. 10, 2023, 6:58 p.m. UTC | #17
On 2/10/23 10:41, Sean Christopherson wrote:
> Anyways, tying things back to the actual code being discussed, I vote against
> CC_ATTR_PARAVISOR.  Being able to trust device emulation is not unique to a
> paravisor.  A single flag also makes too many assumptions about what is trusted
> and thus should be accessed encrypted.

Did you like the more wordy per-device flags better?  Or did you have
something else in mind entirely?
Borislav Petkov Feb. 10, 2023, 7:03 p.m. UTC | #18
On Fri, Feb 10, 2023 at 06:41:54PM +0000, Sean Christopherson wrote:
> Anyways, tying things back to the actual code being discussed, I vote against
> CC_ATTR_PARAVISOR.  Being able to trust device emulation is not unique to a
> paravisor.  A single flag also makes too many assumptions about what is trusted
> and thus should be accessed encrypted.

I'm not crazy about the alternative either: one flag per access type:
IO APIC, vTPM, and soon.

Soon this will become an unmaintainable zoo of different guest types
people want the kernel to support. I don't think I want that madness in
kernel proper.
Michael Kelley (LINUX) Feb. 10, 2023, 7:15 p.m. UTC | #19
From: Borislav Petkov <bp@alien8.de> Sent: Friday, February 10, 2023 11:04 AM
> 
> On Fri, Feb 10, 2023 at 06:41:54PM +0000, Sean Christopherson wrote:
> > Anyways, tying things back to the actual code being discussed, I vote against
> > CC_ATTR_PARAVISOR.  Being able to trust device emulation is not unique to a
> > paravisor.  A single flag also makes too many assumptions about what is trusted
> > and thus should be accessed encrypted.
> 
> I'm not crazy about the alternative either: one flag per access type:
> IO APIC, vTPM, and soon.
>

FWIW, I don't think the list of devices to be accessed encrypted is likely
to grow significantly.  Is one or two more possible?  Perhaps.  Does it
become a list of ten?  I doubt it.

One approach is to go with the individual device attributes for now.
If the list does grow significantly, there will probably be patterns
or groupings that we can't discern now.  We could restructure into
larger buckets at that point based on those patterns/groupings.

Michael

> 
> Soon this will become an unmaintainable zoo of different guest types
> people want the kernel to support. I don't think I want that madness in
> kernel proper.
>
Borislav Petkov Feb. 10, 2023, 7:36 p.m. UTC | #20
On Fri, Feb 10, 2023 at 07:15:41PM +0000, Michael Kelley (LINUX) wrote:
> FWIW, I don't think the list of devices to be accessed encrypted is likely
> to grow significantly.  Is one or two more possible?  Perhaps.  Does it
> become a list of ten?  I doubt it.

What happens if the next silly HV guest scheme comes along and they do
need more and different ones?

Do I say, but but, Michael said that he doubted at the time that that
list would grow... ;-\

And then all our paths are sprinkled with

	if (cc_platform_has())

and we can't change sh*t anymore out of fear that some weird guest type
will break.

> One approach is to go with the individual device attributes for now.
> If the list does grow significantly, there will probably be patterns
> or groupings that we can't discern now.  We could restructure into
> larger buckets at that point based on those patterns/groupings.

There's a reason the word "platform" is in cc_platform_has(). Initially
we wanted to distinguish attributes of the different platforms. So even
if y'all don't like CC_ATTR_PARAVISOR, that is what distinguishes this
platform and it *is* one platform.

So call it CC_ATTR_SEV_VTOM as it uses that technology or whatever. But
call it like the platform, not to mean "I need this functionality".

And yes, we could do the regroupings later because, yeah, those things
are not exposed to userspace so it's not like they're cast in stone but
I fear that we will do regroupings and we will break guests.

Now if you had CC_ATTR_<PLATFORM_TYPE> then you break (or not) only that
platform.

Oh, and then there's the thing that this is kernel proper - that code
still runs on real hardware, for now, and is not only guests. And not
everything is a damn cloud.

So I don't want a zoo here and we'd have to agree to distinguish by
platform and not by different functionality required.

Thx.
Dave Hansen Feb. 10, 2023, 7:58 p.m. UTC | #21
On 2/10/23 11:36, Borislav Petkov wrote:
>> One approach is to go with the individual device attributes for now.>> If the list does grow significantly, there will probably be patterns
>> or groupings that we can't discern now.  We could restructure into
>> larger buckets at that point based on those patterns/groupings.
> There's a reason the word "platform" is in cc_platform_has(). Initially
> we wanted to distinguish attributes of the different platforms. So even
> if y'all don't like CC_ATTR_PARAVISOR, that is what distinguishes this
> platform and it *is* one platform.
> 
> So call it CC_ATTR_SEV_VTOM as it uses that technology or whatever. But
> call it like the platform, not to mean "I need this functionality".

I can live with that.  There's already a CC_ATTR_GUEST_SEV_SNP, so it
would at least not be too much of a break from what we already have.
Sean Christopherson Feb. 10, 2023, 8:50 p.m. UTC | #22
On Fri, Feb 10, 2023, Dave Hansen wrote:
> On 2/10/23 11:36, Borislav Petkov wrote:
> >> One approach is to go with the individual device attributes for now.>> If the list does grow significantly, there will probably be patterns
> >> or groupings that we can't discern now.  We could restructure into
> >> larger buckets at that point based on those patterns/groupings.
> > There's a reason the word "platform" is in cc_platform_has(). Initially
> > we wanted to distinguish attributes of the different platforms. So even
> > if y'all don't like CC_ATTR_PARAVISOR, that is what distinguishes this
> > platform and it *is* one platform.
> > 
> > So call it CC_ATTR_SEV_VTOM as it uses that technology or whatever. But
> > call it like the platform, not to mean "I need this functionality".
> 
> I can live with that.  There's already a CC_ATTR_GUEST_SEV_SNP, so it
> would at least not be too much of a break from what we already have.

I'm fine with CC_ATTR_SEV_VTOM, assuming the proposal is to have something like:

	static inline bool is_address_range_private(resource_size_t addr)
	{
		if (cc_platform_has(CC_ATTR_SEV_VTOM))
			return is_address_below_vtom(addr);

		return false;
	}

i.e. not have SEV_VTOM mean "I/O APIC and vTPM are private".  Though I don't see
the point in making it SEV vTOM specific or using a flag.  Despite what any of us
think about TDX paravisors, it's completely doable within the confines of TDX to
have an emulated device reside in the private address space.  E.g. why not
something like this? 

	static inline bool is_address_range_private(resource_size_t addr)
	{
		return addr < cc_platform_private_end;
	}

where SEV fills in "cc_platform_private_end" when vTOM is enabled, and TDX does
the same.  Or wrap cc_platform_private_end in a helper, etc.
Sean Christopherson Feb. 10, 2023, 8:57 p.m. UTC | #23
On Fri, Feb 10, 2023, Sean Christopherson wrote:
> On Fri, Feb 10, 2023, Dave Hansen wrote:
> > On 2/10/23 11:36, Borislav Petkov wrote:
> > >> One approach is to go with the individual device attributes for now.>> If the list does grow significantly, there will probably be patterns
> > >> or groupings that we can't discern now.  We could restructure into
> > >> larger buckets at that point based on those patterns/groupings.
> > > There's a reason the word "platform" is in cc_platform_has(). Initially
> > > we wanted to distinguish attributes of the different platforms. So even
> > > if y'all don't like CC_ATTR_PARAVISOR, that is what distinguishes this
> > > platform and it *is* one platform.
> > > 
> > > So call it CC_ATTR_SEV_VTOM as it uses that technology or whatever. But
> > > call it like the platform, not to mean "I need this functionality".
> > 
> > I can live with that.  There's already a CC_ATTR_GUEST_SEV_SNP, so it
> > would at least not be too much of a break from what we already have.
> 
> I'm fine with CC_ATTR_SEV_VTOM, assuming the proposal is to have something like:
> 
> 	static inline bool is_address_range_private(resource_size_t addr)
> 	{
> 		if (cc_platform_has(CC_ATTR_SEV_VTOM))
> 			return is_address_below_vtom(addr);
> 
> 		return false;
> 	}
> 
> i.e. not have SEV_VTOM mean "I/O APIC and vTPM are private".  Though I don't see
> the point in making it SEV vTOM specific or using a flag.  Despite what any of us
> think about TDX paravisors, it's completely doable within the confines of TDX to
> have an emulated device reside in the private address space.  E.g. why not
> something like this? 
> 
> 	static inline bool is_address_range_private(resource_size_t addr)
> 	{
> 		return addr < cc_platform_private_end;
> 	}
> 
> where SEV fills in "cc_platform_private_end" when vTOM is enabled, and TDX does
> the same.  Or wrap cc_platform_private_end in a helper, etc.

Gah, forgot that the intent with TDX is to enumerate devices in their legacy
address spaces.  So a TDX guest couldn't do this by default, but if/when Hyper-V
or some other hypervisor moves I/O APIC, vTPM, etc... into the TCB, the common
code would just work and only the hypervisor-specific paravirt code would need
to change.

Probably need a more specific name than is_address_range_private() though, e.g.
is_mmio_address_range_private()?
Michael Kelley (LINUX) Feb. 10, 2023, 9:27 p.m. UTC | #24
From: Sean Christopherson <seanjc@google.com> Sent: Friday, February 10, 2023 12:58 PM
> 
> On Fri, Feb 10, 2023, Sean Christopherson wrote:
> > On Fri, Feb 10, 2023, Dave Hansen wrote:
> > > On 2/10/23 11:36, Borislav Petkov wrote:
> > > >> One approach is to go with the individual device attributes for now.>> If the list
> does grow significantly, there will probably be patterns
> > > >> or groupings that we can't discern now.  We could restructure into
> > > >> larger buckets at that point based on those patterns/groupings.
> > > > There's a reason the word "platform" is in cc_platform_has(). Initially
> > > > we wanted to distinguish attributes of the different platforms. So even
> > > > if y'all don't like CC_ATTR_PARAVISOR, that is what distinguishes this
> > > > platform and it *is* one platform.
> > > >
> > > > So call it CC_ATTR_SEV_VTOM as it uses that technology or whatever. But
> > > > call it like the platform, not to mean "I need this functionality".
> > >
> > > I can live with that.  There's already a CC_ATTR_GUEST_SEV_SNP, so it
> > > would at least not be too much of a break from what we already have.
> >
> > I'm fine with CC_ATTR_SEV_VTOM, assuming the proposal is to have something like:
> >
> > 	static inline bool is_address_range_private(resource_size_t addr)
> > 	{
> > 		if (cc_platform_has(CC_ATTR_SEV_VTOM))
> > 			return is_address_below_vtom(addr);
> >
> > 		return false;
> > 	}
> >
> > i.e. not have SEV_VTOM mean "I/O APIC and vTPM are private".  Though I don't see
> > the point in making it SEV vTOM specific or using a flag.  Despite what any of us
> > think about TDX paravisors, it's completely doable within the confines of TDX to
> > have an emulated device reside in the private address space.  E.g. why not
> > something like this?
> >
> > 	static inline bool is_address_range_private(resource_size_t addr)
> > 	{
> > 		return addr < cc_platform_private_end;
> > 	}
> >
> > where SEV fills in "cc_platform_private_end" when vTOM is enabled, and TDX does
> > the same.  Or wrap cc_platform_private_end in a helper, etc.
> 
> Gah, forgot that the intent with TDX is to enumerate devices in their legacy
> address spaces.  So a TDX guest couldn't do this by default, but if/when Hyper-V
> or some other hypervisor moves I/O APIC, vTPM, etc... into the TCB, the common
> code would just work and only the hypervisor-specific paravirt code would need
> to change.
> 
> Probably need a more specific name than is_address_range_private() though, e.g.
> is_mmio_address_range_private()?

Maybe I'm not understanding what you are proposing, but in an SEV-SNP
VM using vTOM, devices like the IO-APIC and TPM live at their usual guest
physical addresses.  The question is whether the kernel virtual mapping
should be set up with encryption enabled or disabled.   That question can't
be answered by looking at the device's address.  Whether to map a particular
device with encryption enabled really is a property of the "platform" because
it depends on whether the paravisor is emulating the device.  Having the
paravisor emulate the device does not change its guest physical address.

While there's a duality, it's better to think of the vTOM bit as the
"encryption" flag in the PTE rather than part of the guest physical address.
A key part of this patch series is about making that shift in how the vTOM
bit is treated.  With the change, the vTOM bit is treated pretty much the
same as the TDX SHARED flag.

Michael
Sean Christopherson Feb. 10, 2023, 11:47 p.m. UTC | #25
On Fri, Feb 10, 2023, Michael Kelley (LINUX) wrote:
> From: Sean Christopherson <seanjc@google.com> Sent: Friday, February 10, 2023 12:58 PM
> > 
> > On Fri, Feb 10, 2023, Sean Christopherson wrote:
> > > On Fri, Feb 10, 2023, Dave Hansen wrote:
> > > > On 2/10/23 11:36, Borislav Petkov wrote:
> > > > >> One approach is to go with the individual device attributes for now.>> If the list
> > does grow significantly, there will probably be patterns
> > > > >> or groupings that we can't discern now.  We could restructure into
> > > > >> larger buckets at that point based on those patterns/groupings.
> > > > > There's a reason the word "platform" is in cc_platform_has(). Initially
> > > > > we wanted to distinguish attributes of the different platforms. So even
> > > > > if y'all don't like CC_ATTR_PARAVISOR, that is what distinguishes this
> > > > > platform and it *is* one platform.
> > > > >
> > > > > So call it CC_ATTR_SEV_VTOM as it uses that technology or whatever. But
> > > > > call it like the platform, not to mean "I need this functionality".
> > > >
> > > > I can live with that.  There's already a CC_ATTR_GUEST_SEV_SNP, so it
> > > > would at least not be too much of a break from what we already have.
> > >
> > > I'm fine with CC_ATTR_SEV_VTOM, assuming the proposal is to have something like:
> > >
> > > 	static inline bool is_address_range_private(resource_size_t addr)
> > > 	{
> > > 		if (cc_platform_has(CC_ATTR_SEV_VTOM))
> > > 			return is_address_below_vtom(addr);
> > >
> > > 		return false;
> > > 	}
> > >
> > > i.e. not have SEV_VTOM mean "I/O APIC and vTPM are private".  Though I don't see
> > > the point in making it SEV vTOM specific or using a flag.  Despite what any of us
> > > think about TDX paravisors, it's completely doable within the confines of TDX to
> > > have an emulated device reside in the private address space.  E.g. why not
> > > something like this?
> > >
> > > 	static inline bool is_address_range_private(resource_size_t addr)
> > > 	{
> > > 		return addr < cc_platform_private_end;
> > > 	}
> > >
> > > where SEV fills in "cc_platform_private_end" when vTOM is enabled, and TDX does
> > > the same.  Or wrap cc_platform_private_end in a helper, etc.
> > 
> > Gah, forgot that the intent with TDX is to enumerate devices in their legacy
> > address spaces.  So a TDX guest couldn't do this by default, but if/when Hyper-V
> > or some other hypervisor moves I/O APIC, vTPM, etc... into the TCB, the common
> > code would just work and only the hypervisor-specific paravirt code would need
> > to change.
> > 
> > Probably need a more specific name than is_address_range_private() though, e.g.
> > is_mmio_address_range_private()?
> 
> Maybe I'm not understanding what you are proposing, but in an SEV-SNP
> VM using vTOM, devices like the IO-APIC and TPM live at their usual guest
> physical addresses.

Ah, so as the cover letter says, the intent really is to treat vTOM as an
attribute bit.  Sorry, I got confused by Boris's comment:

  : What happens if the next silly HV guest scheme comes along and they do
  : need more and different ones?

Based on that comment, I assumed the proposal to use CC_ATTR_SEV_VTOM was intended
to be a generic range-based thing, but it sounds like that's not the case. 

IMO, using CC_ATTR_SEV_VTOM to infer anything about the state of I/O APIC or vTPM
is wrong.  vTOM as a platform feature effectively says "physical address bit X
controls private vs. shared" (ignoring weird usage of vTOM).  vTOM does not mean
I/O APIC and vTPM are private, that's very much a property of Hyper-V's current
generation of vTOM-based VMs.

Hardcoding this in the guest feels wrong.  Ideally, we would have a way to enumerate
that a device is private/trusted, e.g. through ACPI.  I'm guessing we already
missed the boat on that, so the next best thing is to do something like Michael
originally proposed in this patch and shove the "which devices are private" logic
into hypervisor-specific code, i.e. let Hyper-V figure out how to enumerate to its
guests which devices are shared.

I agree with Boris' comment that a one-off "other encrypted range" is a hack, but
that's just an API problem.  The kernel already has hypervisor specific hooks (and
for SEV-ES even), why not expand that?  That way figuring out which devices are
private is wholly contained in Hyper-V code, at least until there's a generic
solution for enumerating private devices, though that seems unlikely to happen
and will be a happy problem to solve if it does come about.

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index a868b76cd3d4..08f65ed439d9 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2682,11 +2682,16 @@ static void io_apic_set_fixmap(enum fixed_addresses idx, phys_addr_t phys)
 {
        pgprot_t flags = FIXMAP_PAGE_NOCACHE;
 
-       /*
-        * Ensure fixmaps for IOAPIC MMIO respect memory encryption pgprot
-        * bits, just like normal ioremap():
-        */
-       flags = pgprot_decrypted(flags);
+       if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) {
+               /*
+               * Ensure fixmaps for IOAPIC MMIO respect memory encryption pgprot
+               * bits, just like normal ioremap():
+               */
+               if (x86_platform.hyper.is_private_mmio(phys))
+                       flags = pgprot_encrypted(flags);
+               else
+                       flags = pgprot_decrypted(flags);
+       }
 
        __set_fixmap(idx, phys, flags);
 }
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 6453fbaedb08..0baec766b921 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -116,6 +116,9 @@ static void __ioremap_check_other(resource_size_t addr, struct ioremap_desc *des
        if (!cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
                return;
 
+       if (x86_platform.hyper.is_private_mmio(addr))
+               desc->flags |= IORES_MAP_ENCRYPTED;
+
        if (!IS_ENABLED(CONFIG_EFI))
                return;
Michael Kelley (LINUX) Feb. 14, 2023, 7:45 a.m. UTC | #26
From: Sean Christopherson <seanjc@google.com> Sent: Friday, February 10, 2023 3:47 PM
> 
> On Fri, Feb 10, 2023, Michael Kelley (LINUX) wrote:
> > From: Sean Christopherson <seanjc@google.com> Sent: Friday, February 10, 2023
> 12:58 PM
> > >
> > > On Fri, Feb 10, 2023, Sean Christopherson wrote:
> > > > On Fri, Feb 10, 2023, Dave Hansen wrote:
> > > > > On 2/10/23 11:36, Borislav Petkov wrote:
> > > > > >> One approach is to go with the individual device attributes for now.>> If the list
> > > does grow significantly, there will probably be patterns
> > > > > >> or groupings that we can't discern now.  We could restructure into
> > > > > >> larger buckets at that point based on those patterns/groupings.
> > > > > > There's a reason the word "platform" is in cc_platform_has(). Initially
> > > > > > we wanted to distinguish attributes of the different platforms. So even
> > > > > > if y'all don't like CC_ATTR_PARAVISOR, that is what distinguishes this
> > > > > > platform and it *is* one platform.
> > > > > >
> > > > > > So call it CC_ATTR_SEV_VTOM as it uses that technology or whatever. But
> > > > > > call it like the platform, not to mean "I need this functionality".
> > > > >
> > > > > I can live with that.  There's already a CC_ATTR_GUEST_SEV_SNP, so it
> > > > > would at least not be too much of a break from what we already have.
> > > >
> > > > I'm fine with CC_ATTR_SEV_VTOM, assuming the proposal is to have something
> like:
> > > >
> > > > 	static inline bool is_address_range_private(resource_size_t addr)
> > > > 	{
> > > > 		if (cc_platform_has(CC_ATTR_SEV_VTOM))
> > > > 			return is_address_below_vtom(addr);
> > > >
> > > > 		return false;
> > > > 	}
> > > >
> > > > i.e. not have SEV_VTOM mean "I/O APIC and vTPM are private".  Though I don't
> see
> > > > the point in making it SEV vTOM specific or using a flag.  Despite what any of us
> > > > think about TDX paravisors, it's completely doable within the confines of TDX to
> > > > have an emulated device reside in the private address space.  E.g. why not
> > > > something like this?
> > > >
> > > > 	static inline bool is_address_range_private(resource_size_t addr)
> > > > 	{
> > > > 		return addr < cc_platform_private_end;
> > > > 	}
> > > >
> > > > where SEV fills in "cc_platform_private_end" when vTOM is enabled, and TDX does
> > > > the same.  Or wrap cc_platform_private_end in a helper, etc.
> > >
> > > Gah, forgot that the intent with TDX is to enumerate devices in their legacy
> > > address spaces.  So a TDX guest couldn't do this by default, but if/when Hyper-V
> > > or some other hypervisor moves I/O APIC, vTPM, etc... into the TCB, the common
> > > code would just work and only the hypervisor-specific paravirt code would need
> > > to change.
> > >
> > > Probably need a more specific name than is_address_range_private() though, e.g.
> > > is_mmio_address_range_private()?
> >
> > Maybe I'm not understanding what you are proposing, but in an SEV-SNP
> > VM using vTOM, devices like the IO-APIC and TPM live at their usual guest
> > physical addresses.
> 
> Ah, so as the cover letter says, the intent really is to treat vTOM as an
> attribute bit.  Sorry, I got confused by Boris's comment:
> 
>   : What happens if the next silly HV guest scheme comes along and they do
>   : need more and different ones?
> 
> Based on that comment, I assumed the proposal to use CC_ATTR_SEV_VTOM was
> intended
> to be a generic range-based thing, but it sounds like that's not the case.
> 
> IMO, using CC_ATTR_SEV_VTOM to infer anything about the state of I/O APIC or vTPM
> is wrong.  vTOM as a platform feature effectively says "physical address bit X
> controls private vs. shared" (ignoring weird usage of vTOM).  vTOM does not mean
> I/O APIC and vTPM are private, that's very much a property of Hyper-V's current
> generation of vTOM-based VMs.
> 
> Hardcoding this in the guest feels wrong.  Ideally, we would have a way to enumerate
> that a device is private/trusted, e.g. through ACPI.  I'm guessing we already
> missed the boat on that, so the next best thing is to do something like Michael
> originally proposed in this patch and shove the "which devices are private" logic
> into hypervisor-specific code, i.e. let Hyper-V figure out how to enumerate to its
> guests which devices are shared.
> 
> I agree with Boris' comment that a one-off "other encrypted range" is a hack, but
> that's just an API problem.  The kernel already has hypervisor specific hooks (and
> for SEV-ES even), why not expand that?  That way figuring out which devices are
> private is wholly contained in Hyper-V code, at least until there's a generic
> solution for enumerating private devices, though that seems unlikely to happen
> and will be a happy problem to solve if it does come about.

Yes, this is definitely a cleaner way to implement what I was originally proposing.
I can make it work if there's agreement to take this approach.

Michael 

> 
> diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
> index a868b76cd3d4..08f65ed439d9 100644
> --- a/arch/x86/kernel/apic/io_apic.c
> +++ b/arch/x86/kernel/apic/io_apic.c
> @@ -2682,11 +2682,16 @@ static void io_apic_set_fixmap(enum fixed_addresses idx,
> phys_addr_t phys)
>  {
>         pgprot_t flags = FIXMAP_PAGE_NOCACHE;
> 
> -       /*
> -        * Ensure fixmaps for IOAPIC MMIO respect memory encryption pgprot
> -        * bits, just like normal ioremap():
> -        */
> -       flags = pgprot_decrypted(flags);
> +       if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) {
> +               /*
> +               * Ensure fixmaps for IOAPIC MMIO respect memory encryption pgprot
> +               * bits, just like normal ioremap():
> +               */
> +               if (x86_platform.hyper.is_private_mmio(phys))
> +                       flags = pgprot_encrypted(flags);
> +               else
> +                       flags = pgprot_decrypted(flags);
> +       }
> 
>         __set_fixmap(idx, phys, flags);
>  }
> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> index 6453fbaedb08..0baec766b921 100644
> --- a/arch/x86/mm/ioremap.c
> +++ b/arch/x86/mm/ioremap.c
> @@ -116,6 +116,9 @@ static void __ioremap_check_other(resource_size_t addr, struct
> ioremap_desc *des
>         if (!cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
>                 return;
> 
> +       if (x86_platform.hyper.is_private_mmio(addr))
> +               desc->flags |= IORES_MAP_ENCRYPTED;
> +
>         if (!IS_ENABLED(CONFIG_EFI))
>                 return;
>
Borislav Petkov Feb. 16, 2023, 1:32 p.m. UTC | #27
On Fri, Feb 10, 2023 at 11:47:27PM +0000, Sean Christopherson wrote:
> I agree with Boris' comment that a one-off "other encrypted range" is a hack, but
> that's just an API problem.  The kernel already has hypervisor specific hooks (and
> for SEV-ES even), why not expand that?  That way figuring out which devices are
> private is wholly contained in Hyper-V code, at least until there's a generic
> solution for enumerating private devices, though that seems unlikely to happen
> and will be a happy problem to solve if it does come about.

I feel ya and this all makes sense and your proposals look clean enough
to me but we still need some way of determining whether this is a vTOM
on hyperv because there's the next crapola with

https://lore.kernel.org/r/20230209072220.6836-4-jgross@suse.com

because apparently hyperv does PAT but disables MTRRs for such vTOM
SEV-SNP guests and ... madness.

But that's not the only example - Xen has been doing this thing too.

And Jürgen has been trying to address this in a clean way but it is
a pain.

What I don't want to have is a gazillion ways to check what needs to
happen for which guest type. Because people who change the kernel to run
on baremetal, will break them. And I can't blame them. We try to support
all kinds of guests in the x86 code but this support should be plain and
simple.

Thx.
Michael Kelley (LINUX) Feb. 16, 2023, 4:16 p.m. UTC | #28
From: Borislav Petkov <bp@alien8.de> Sent: Thursday, February 16, 2023 5:33 AM
> 
> On Fri, Feb 10, 2023 at 11:47:27PM +0000, Sean Christopherson wrote:
> > I agree with Boris' comment that a one-off "other encrypted range" is a hack, but
> > that's just an API problem.  The kernel already has hypervisor specific hooks (and
> > for SEV-ES even), why not expand that?  That way figuring out which devices are
> > private is wholly contained in Hyper-V code, at least until there's a generic
> > solution for enumerating private devices, though that seems unlikely to happen
> > and will be a happy problem to solve if it does come about.
> 
> I feel ya and this all makes sense and your proposals look clean enough
> to me but we still need some way of determining whether this is a vTOM
> on hyperv 

Historically, callbacks like Sean proposed default to NULL and do nothing
unless they are explicitly set.  The Hyper-V vTOM code would set the callback.
Is that not sufficient?  Or in the two places where the callback would
be made, do you want to bracket with a test for being in a Hyper-V vTOM
VM?  If so, then we're back to needing something like CC_ATTR_PARAVISOR
on which to gate the callbacks.

Or do you mean something else entirely?

Michael

> because there's the next crapola with
> 
> https://lore.kernel.org/all/20230209072220.6836-4-jgross@suse.com/
> 
> because apparently hyperv does PAT but disables MTRRs for such vTOM
> SEV-SNP guests and ... madness.
> 
> But that's not the only example - Xen has been doing this thing too.
> 
> And Jürgen has been trying to address this in a clean way but it is
> a pain.
> 
> What I don't want to have is a gazillion ways to check what needs to
> happen for which guest type. Because people who change the kernel to run
> on baremetal, will break them. And I can't blame them. We try to support
> all kinds of guests in the x86 code but this support should be plain and
> simple.
>
Borislav Petkov Feb. 16, 2023, 5:06 p.m. UTC | #29
On Thu, Feb 16, 2023 at 04:16:16PM +0000, Michael Kelley (LINUX) wrote:
> Historically, callbacks like Sean proposed default to NULL and do nothing
> unless they are explicitly set.  The Hyper-V vTOM code would set the callback.
> Is that not sufficient?  Or in the two places where the callback would
> be made, do you want to bracket with a test for being in a Hyper-V vTOM
> VM?  If so, then we're back to needing something like CC_ATTR_PARAVISOR
> on which to gate the callbacks.
> 
> Or do you mean something else entirely?

See the second part of my reply.

This thing...

> > because there's the next crapola with
> > 
> > https://lore.kernel.org/all/20230209072220.6836-4-jgross@suse.com/
> > 
> > because apparently hyperv does PAT but disables MTRRs for such vTOM
> > SEV-SNP guests and ... madness.
> > 
> > But that's not the only example - Xen has been doing this thing too.
> > 
> > And Jürgen has been trying to address this in a clean way but it is
> > a pain.
> > 
> > What I don't want to have is a gazillion ways to check what needs to
> > happen for which guest type. Because people who change the kernel to run
> > on baremetal, will break them. And I can't blame them. We try to support
> > all kinds of guests in the x86 code but this support should be plain and
> > simple.

... here.

We need a single way to test for this guest type and stick with it.

I'd like for all guest types we support to be queried in a plain and
simple way.

Not:

* CC_ATTR_GUEST_MEM_ENCRYPT

* x86_platform.hyper.is_private_mmio(addr)

* CC_ATTR_PARAVISOR

to mean three different aspects of SEV-SNP guests using vTOM on Hyper-V.

This is going to be a major mess which we won't support.

Thx.
Michael Kelley (LINUX) Feb. 17, 2023, 6:16 a.m. UTC | #30
From: Borislav Petkov <bp@alien8.de> Sent: Thursday, February 16, 2023 9:07 AM
> 
> ... here.
> 
> We need a single way to test for this guest type and stick with it.
> 
> I'd like for all guest types we support to be queried in a plain and
> simple way.
> 
> Not:
> 
> * CC_ATTR_GUEST_MEM_ENCRYPT
> 
> * x86_platform.hyper.is_private_mmio(addr)
> 
> * CC_ATTR_PARAVISOR
> 
> to mean three different aspects of SEV-SNP guests using vTOM on Hyper-V.
> 
> This is going to be a major mess which we won't support.
> 

OK, I'm trying to figure out where to go next.  I've been following the pattern
set by the SEV/SEV-ES/SEV-SNP and TDX platforms in the cc_platform_has()
function.   Each platform returns "true" for multiple CC_ATTR_* values,
and those CC_ATTR_* values are tested in multiple places throughout
kernel code.  Some CC_ATTR_* values are shared by multiple platforms
(like CC_ATTR_GUEST_MEM_ENCRYPT) and some are unique to a particular
platform (like CC_ATTR_HOTPLUG_DISABLED).  For the CC_ATTR_* values
that are shared, the logic of which platforms they apply to occurs once in
cc_platform_has() rather than scattered and duplicated throughout the
kernel, which makes sense.  Any given platform is not represented by a
single CC_ATTR_* value, but by multiple ones.  Each CC_ATTR_* value 
essentially represents a chunk of kernel functionality that one or more
platforms need, and the platform indicates its need by cc_platform_has()
returning "true" for that value.

So I've been trying to apply the same pattern to the SNP vTOM sub-case
of SEV-SNP.   Is that consistent with your thinking, or is the whole
cc_platform_has() approach problematic, including for the existing
SEV flavors and for TDX?

Michael
Borislav Petkov Feb. 17, 2023, 2:55 p.m. UTC | #31
On Fri, Feb 17, 2023 at 06:16:56AM +0000, Michael Kelley (LINUX) wrote:
> Is that consistent with your thinking, or is the whole
> cc_platform_has() approach problematic, including for the existing SEV
> flavors and for TDX?

The confidential computing attributes are, yes, features. I've been
preaching since the very beginning that vTOM *is* *also* one such
feature. It is a feature bit in sev_features, for chrissakes. So by that
logic, those SEV-SNP HyperV guests should return true when

	cc_platform_has(CC_ATTR_GUEST_SEV_SNP_VTOM);

is tested.

But Sean doesn't like that.

If the access method to the IO-APIC and vTPM are specific to the
HyperV's vTOM implementation, then I don't mind if this were called

	cc_platform_has(CC_ATTR_GUEST_HYPERV_VTOM);

Frankly, I don't see any other enlightened guest using vTOM except
HyperV's but virt folks have managed to surprise me in the past too.

In any case, a single flag which is specific to that guest type is fine
too.

It feels like we're running in circles by now... ;-\
Sean Christopherson Feb. 22, 2023, 10:13 p.m. UTC | #32
On Fri, Feb 17, 2023, Borislav Petkov wrote:
> On Fri, Feb 17, 2023 at 06:16:56AM +0000, Michael Kelley (LINUX) wrote:
> > Is that consistent with your thinking, or is the whole
> > cc_platform_has() approach problematic, including for the existing SEV
> > flavors and for TDX?
> 
> The confidential computing attributes are, yes, features. I've been
> preaching since the very beginning that vTOM *is* *also* one such
> feature. It is a feature bit in sev_features, for chrissakes. So by that
> logic, those SEV-SNP HyperV guests should return true when
> 
> 	cc_platform_has(CC_ATTR_GUEST_SEV_SNP_VTOM);
> 
> is tested.
> 
> But Sean doesn't like that.

Because vTOM is a hardware feature, whereas the IO-APIC and vTPM being accessible
via private memory are software features.  It's very possible to emulate the
IO-APIC in trusted code without vTOM.

> If the access method to the IO-APIC and vTPM are specific to the
> HyperV's vTOM implementation, then I don't mind if this were called
> 
> 	cc_platform_has(CC_ATTR_GUEST_HYPERV_VTOM);

I still think that's likely to caused problems in the future, e.g. if Hyper-V
moves more stuff into the paravisor or if Hyper-V ends up with similar functionality
for TDX.  But it's not a sticking point, the only thing I'm fiercely resistant to
is conflating hardware features with software features.
Borislav Petkov Feb. 22, 2023, 10:33 p.m. UTC | #33
On Wed, Feb 22, 2023 at 02:13:44PM -0800, Sean Christopherson wrote:
> Because vTOM is a hardware feature, whereas the IO-APIC and vTPM being accessible
> via private memory are software features.  It's very possible to emulate the
> IO-APIC in trusted code without vTOM.

I know, but their use case is dictated by the fact that they're using
a SNP guest *with* vTOM as a SEV feature. And so their guest does
IO-APIC and vTPM *with* the vTOM SEV feature. That's what I'm trying to
model.

> > If the access method to the IO-APIC and vTPM are specific to the
> > HyperV's vTOM implementation, then I don't mind if this were called
> > 
> > 	cc_platform_has(CC_ATTR_GUEST_HYPERV_VTOM);
> 
> I still think that's likely to caused problems in the future, e.g. if Hyper-V
> moves more stuff into the paravisor or if Hyper-V ends up with similar functionality
> for TDX.

Yah, reportedly, TDX folks are not very interested in this case.

> But it's not a sticking point, the only thing I'm fiercely resistant to
> is conflating hardware features with software features.

So you and I need to find a common ground...
Sean Christopherson Feb. 22, 2023, 10:54 p.m. UTC | #34
On Wed, Feb 22, 2023, Borislav Petkov wrote:
> On Wed, Feb 22, 2023 at 02:13:44PM -0800, Sean Christopherson wrote:
> > Because vTOM is a hardware feature, whereas the IO-APIC and vTPM being accessible
> > via private memory are software features.  It's very possible to emulate the
> > IO-APIC in trusted code without vTOM.
> 
> I know, but their use case is dictated by the fact that they're using
> a SNP guest *with* vTOM as a SEV feature. And so their guest does
> IO-APIC and vTPM *with* the vTOM SEV feature. That's what I'm trying to
> model.

Why?  I genuinely don't understand the motivation for bundling all of this stuff
under a single "feature".  To me, that's like saying Haswell or Zen2 is a "feature",
but outside of a very few cases where the exact uarch truly matters, nothing pivots
on FMS because the CPU type is not a single feature.
Borislav Petkov Feb. 22, 2023, 11:34 p.m. UTC | #35
On Wed, Feb 22, 2023 at 02:54:47PM -0800, Sean Christopherson wrote:
> Why?  I genuinely don't understand the motivation for bundling all of this stuff
> under a single "feature".

It is called "sanity".

See here:

https://lore.kernel.org/r/Y%2B5immKTXCsjSysx@zn.tnic

We support SEV, SEV-ES, SEV-SNP, TDX, HyperV... guests and whatever's
coming down the pipe. And all that goes into arch/x86/ kernel proper
code.

The CC_ATTR stuff is clean-ish in the sense that we have separation by
confidential computing platform - AMD's and Intel's. Hyper-V comes along
and wants to define a different subset of that. And that's only the
SEV-SNP side - there's a TDX patchset too.

And then some other hypervisor will come along and say, but but, I wanna
have X and Y and a pink pony too.

Oh, and there's this other fun with MTRRs where each HV decides to do
whatever it wants.

So, we have a zoo brewing on the horizon already!

If there's no clean definition of what each guest is and requires and
that stuff isn't documented properly and if depending on which "feature"
I need to check, I need to call a different function or query
a different variable, then it won't go anywhere as far as guest support
goes.

The cc_platform_has() thing gives us a relatively clean way to abstract
all those differences away and keep the code sane-ish.
Sean Christopherson Feb. 23, 2023, 1:21 a.m. UTC | #36
On Thu, Feb 23, 2023, Borislav Petkov wrote:
> On Wed, Feb 22, 2023 at 02:54:47PM -0800, Sean Christopherson wrote:
> > Why?  I genuinely don't understand the motivation for bundling all of this stuff
> > under a single "feature".
> 
> It is called "sanity".
> 
> See here:
> 
> https://lore.kernel.org/r/Y%2B5immKTXCsjSysx@zn.tnic
> 
> We support SEV, SEV-ES, SEV-SNP, TDX, HyperV... guests and whatever's
> coming down the pipe. And all that goes into arch/x86/ kernel proper
> code.
> 
> The CC_ATTR stuff is clean-ish in the sense that we have separation by
> confidential computing platform - AMD's and Intel's. Hyper-V comes along
> and wants to define a different subset of that. And that's only the
> SEV-SNP side - there's a TDX patchset too.
> 
> And then some other hypervisor will come along and say, but but, I wanna
> have X and Y and a pink pony too.
> 
> Oh, and there's this other fun with MTRRs where each HV decides to do
> whatever it wants.

The MTRR mess isn't unique to coco guests, e.g. KVM explicitly "supports" VMMs
hiding MTTRs from the guest by defaulting to WB if MTTRs aren't exposed to the
guest.  Why on earth Hyper-V suddenly needs to enlighten the guest is beyond me,
but whatever the reason, it's not unique to coco VMs.

> So, we have a zoo brewing on the horizon already!
> 
> If there's no clean definition of what each guest is and requires and
> that stuff isn't documented properly and if depending on which "feature"
> I need to check, I need to call a different function or query
> a different variable, then it won't go anywhere as far as guest support
> goes.
> 
> The cc_platform_has() thing gives us a relatively clean way to abstract
> all those differences away and keep the code sane-ish.

For features that are inherent to the platform, I agree, or at least I don't hate
the interface.  But defining a platform to have specific devices runs counter to
pretty much the entire x86 ecosystem.  At some point, there _will_ be more devices
in private memory than just IO-APIC and TPM, and conversely there will be "platforms"
that support a trusted TPM but not a trusted IO-APIC, and probably even vice versa.

All I'm advocating is that for determining whether or not a device should be mapped
private vs. shared, provide an API so that the hypervisor-specific enlightened code
can manage that insanity without polluting common code.  If we are ever fortunate
enough to have common enumeration, e.g. through ACPI or something, the enlightened
code can simply reroute to the common code.  This is a well established pattern for
many paravirt features, I don't see why it wouldn't work here.
Borislav Petkov Feb. 23, 2023, 10:45 a.m. UTC | #37
On Wed, Feb 22, 2023 at 05:21:27PM -0800, Sean Christopherson wrote:
> The MTRR mess isn't unique to coco guests, e.g. KVM explicitly "supports" VMMs
> hiding MTTRs from the guest by defaulting to WB if MTTRs aren't exposed to the
> guest.  Why on earth Hyper-V suddenly needs to enlighten the guest is beyond me,
> but whatever the reason, it's not unique to coco VMs.

Well, TDX can't stomach MTRRs either, reportedly, and I hear we should
try to avoid #VEs for them too.

And this is the problem: all those guest "enlightening" efforts come up
with the weirdest stuff they need to sprinkle around arch/x86/. And if
we let that without paying attention to the big picture, that will
become an unmaintanable mess.

And I'm not proud of some of the stuff we did in arch/x86/ already and
some day they'll get on my nerves just enough...

> All I'm advocating is that for determining whether or not a device should be mapped
> private vs. shared, provide an API so that the hypervisor-specific enlightened code
> can manage that insanity without polluting common code.  If we are ever fortunate
> enough to have common enumeration, e.g. through ACPI or something, the enlightened
> code can simply reroute to the common code.  This is a well established pattern for
> many paravirt features, I don't see why it wouldn't work here.

Yah, that would be good. If the device can know upfront how it needs to
ioremap its address range, then that is fine - we already have
ioremap_encrypted() for example.

What I don't like is hooking conditionals into the common code to figure
out what to do depending on what we're running on.
Michael Kelley (LINUX) Feb. 23, 2023, 8:01 p.m. UTC | #38
From: Borislav Petkov <bp@alien8.de> Sent: Thursday, February 23, 2023 2:45 AM
> 
> On Wed, Feb 22, 2023 at 05:21:27PM -0800, Sean Christopherson wrote:
> 
> > All I'm advocating is that for determining whether or not a device should be mapped
> > private vs. shared, provide an API so that the hypervisor-specific enlightened code
> > can manage that insanity without polluting common code.  If we are ever fortunate
> > enough to have common enumeration, e.g. through ACPI or something, the enlightened
> > code can simply reroute to the common code.  This is a well established pattern for
> > many paravirt features, I don't see why it wouldn't work here.
> 
> Yah, that would be good.

Just so I'm clear, are you saying you are good with the proposal that Sean
sketched out with code here? [1]   With his proposal, the device driver
is not involved in deciding whether to map encrypted or decrypted.  That
decision is made in the hypervisor-specific callback function based on
the physical address.   The callback has to be made in two places in common
code.  But then as Sean said, " the hypervisor-specific enlightened code
can manage that insanity without polluting common code". :-)

I like Sean's proposal, so if you are good with it, I'll do v6 of the patch
set with that approach.

Dave Hansen:  Are you also OK with Sean's proposal?  Looking for consensus
here ....

> If the device can know upfront how it needs to
> ioremap its address range, then that is fine - we already have
> ioremap_encrypted() for example.
> 

Again, the device driver would not be involved in Sean's proposal.

Michael

[1] https://lore.kernel.org/linux-hyperv/Y+bXjxUtSf71E5SS@google.com/
Dave Hansen Feb. 23, 2023, 8:26 p.m. UTC | #39
On 2/10/23 15:47, Sean Christopherson wrote:
> diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
> index a868b76cd3d4..08f65ed439d9 100644
> --- a/arch/x86/kernel/apic/io_apic.c
> +++ b/arch/x86/kernel/apic/io_apic.c
> @@ -2682,11 +2682,16 @@ static void io_apic_set_fixmap(enum fixed_addresses idx, phys_addr_t phys)
>  {
>         pgprot_t flags = FIXMAP_PAGE_NOCACHE;
>  
> -       /*
> -        * Ensure fixmaps for IOAPIC MMIO respect memory encryption pgprot
> -        * bits, just like normal ioremap():
> -        */
> -       flags = pgprot_decrypted(flags);
> +       if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) {
> +               /*
> +               * Ensure fixmaps for IOAPIC MMIO respect memory encryption pgprot
> +               * bits, just like normal ioremap():
> +               */
> +               if (x86_platform.hyper.is_private_mmio(phys))
> +                       flags = pgprot_encrypted(flags);
> +               else
> +                       flags = pgprot_decrypted(flags);
> +       }

I don't completely hate this.  Thinking to the future, I'd hope that
future platforms will include information about which physical addresses
are shared or private.  This might even vary per device, but this
interface would still work.

I _think_ it would be nicer to wrap both checks up in a helper where the
comments can be more detailed, like:

	if (cc_private_mmio(phys))
		flags = pgprot_encrypted(flags);
	else
		flags = pgprot_decrypted(flags);

but I honestly don't feel that strongly about it.

It does seem a bit odd that there's a new CC_ATTR_GUEST_MEM_ENCRYPT
check wrapping this whole thing.  I guess the trip through
pgprot_decrypted() is harmless on normal platforms, though.

> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> index 6453fbaedb08..0baec766b921 100644
> --- a/arch/x86/mm/ioremap.c
> +++ b/arch/x86/mm/ioremap.c
> @@ -116,6 +116,9 @@ static void __ioremap_check_other(resource_size_t addr, struct ioremap_desc *des
>         if (!cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
>                 return;
>  
> +       if (x86_platform.hyper.is_private_mmio(addr))
> +               desc->flags |= IORES_MAP_ENCRYPTED;
> +
>         if (!IS_ENABLED(CONFIG_EFI))
>                 return;
>
Dave Hansen Feb. 23, 2023, 8:27 p.m. UTC | #40
On 2/23/23 12:01, Michael Kelley (LINUX) wrote:
> Dave Hansen:  Are you also OK with Sean's proposal?  Looking for consensus
> here ....

Yeah, I'm generally OK with it as long as Borislav is.
Dave Hansen Feb. 23, 2023, 8:41 p.m. UTC | #41
On 2/23/23 12:26, Dave Hansen wrote:
>> +       if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) {
>> +               /*
>> +               * Ensure fixmaps for IOAPIC MMIO respect memory encryption pgprot
>> +               * bits, just like normal ioremap():
>> +               */
>> +               if (x86_platform.hyper.is_private_mmio(phys))
>> +                       flags = pgprot_encrypted(flags);
>> +               else
>> +                       flags = pgprot_decrypted(flags);
>> +       }
...
> It does seem a bit odd that there's a new CC_ATTR_GUEST_MEM_ENCRYPT
> check wrapping this whole thing.  I guess the trip through
> pgprot_decrypted() is harmless on normal platforms, though.

Yeah, that's _really_ odd.  Sean, were you trying to optimize away the
indirect call or something?

I would just expect the Hyper-V/vTOM code to leave
x86_platform.hyper.is_private_mmio alone unless
it *knows* the platform has private MMIO *and* CC_ATTR_GUEST_MEM_ENCRYPT.

Is there ever a case where CC_ATTR_GUEST_MEM_ENCRYPT==0 and he
Hyper-V/vTOM code would need to set x86_platform.hyper.is_private_mmio?
Michael Kelley (LINUX) Feb. 23, 2023, 8:51 p.m. UTC | #42
From: Dave Hansen <dave.hansen@intel.com> Sent: Thursday, February 23, 2023 12:42 PM
> 
> On 2/23/23 12:26, Dave Hansen wrote:
> >> +       if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) {
> >> +               /*
> >> +               * Ensure fixmaps for IOAPIC MMIO respect memory encryption pgprot
> >> +               * bits, just like normal ioremap():
> >> +               */
> >> +               if (x86_platform.hyper.is_private_mmio(phys))
> >> +                       flags = pgprot_encrypted(flags);
> >> +               else
> >> +                       flags = pgprot_decrypted(flags);
> >> +       }
> ...
> > It does seem a bit odd that there's a new CC_ATTR_GUEST_MEM_ENCRYPT
> > check wrapping this whole thing.  I guess the trip through
> > pgprot_decrypted() is harmless on normal platforms, though.
> 
> Yeah, that's _really_ odd.  Sean, were you trying to optimize away the
> indirect call or something?
> 
> I would just expect the Hyper-V/vTOM code to leave
> x86_platform.hyper.is_private_mmio alone unless
> it *knows* the platform has private MMIO *and* CC_ATTR_GUEST_MEM_ENCRYPT.

Agreed.

> 
> Is there ever a case where CC_ATTR_GUEST_MEM_ENCRYPT==0 and he
> Hyper-V/vTOM code would need to set x86_platform.hyper.is_private_mmio?

There's no such case. 

I agree that gating with CC_ATTR_GUEST_MEM_ENCRYPT isn't really necessary.
Current upstream code always does the pgprot_decrypted(), and as you said,
that's a no-op on platforms with no memory encryption.

Michael
Sean Christopherson Feb. 23, 2023, 9:07 p.m. UTC | #43
On Thu, Feb 23, 2023, Michael Kelley (LINUX) wrote:
> From: Dave Hansen <dave.hansen@intel.com> Sent: Thursday, February 23, 2023 12:42 PM
> > 
> > On 2/23/23 12:26, Dave Hansen wrote:
> > >> +       if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) {
> > >> +               /*
> > >> +               * Ensure fixmaps for IOAPIC MMIO respect memory encryption pgprot
> > >> +               * bits, just like normal ioremap():
> > >> +               */
> > >> +               if (x86_platform.hyper.is_private_mmio(phys))
> > >> +                       flags = pgprot_encrypted(flags);
> > >> +               else
> > >> +                       flags = pgprot_decrypted(flags);
> > >> +       }
> > ...
> > > It does seem a bit odd that there's a new CC_ATTR_GUEST_MEM_ENCRYPT
> > > check wrapping this whole thing.  I guess the trip through
> > > pgprot_decrypted() is harmless on normal platforms, though.
> > 
> > Yeah, that's _really_ odd.  Sean, were you trying to optimize away the
> > indirect call or something?

No, my thought was simply to require platforms that support GUEST_MEM_ENCRYPT to
implement x86_platform.hyper.is_private_mmio, e.g. to avoid having to check if
is_private_mmio is NULL, to explicit document that non-Hyper-V encrypted guests
don't (yet) support private MMIO, and to add a bit of documentation around the
{de,en}crypted logic.

> > I would just expect the Hyper-V/vTOM code to leave
> > x86_platform.hyper.is_private_mmio alone unless it *knows* the platform has
> > private MMIO *and* CC_ATTR_GUEST_MEM_ENCRYPT.
> 
> Agreed.
> 
> > 
> > Is there ever a case where CC_ATTR_GUEST_MEM_ENCRYPT==0 and he
> > Hyper-V/vTOM code would need to set x86_platform.hyper.is_private_mmio?
> 
> There's no such case. 
> 
> I agree that gating with CC_ATTR_GUEST_MEM_ENCRYPT isn't really necessary.
> Current upstream code always does the pgprot_decrypted(), and as you said,
> that's a no-op on platforms with no memory encryption.

Right, but since is_private_mmio can be NULL, unless I'm missing something we'll
need an extra check no matter what, i.e. the alternative would be

	if (x86_platform.hyper.is_private_mmio &&
	    x86_platform.hyper.is_private_mmio(phys))
		flags = pgprot_encrypted(flags);
	else
		flags = pgprot_decrypted(flags);

I have no objection to that approach.  It does have the advantage of not needing
an indirect call for encrypted guests that don't support private MMIO, though
I can't imagine this code is performance sensitive.
Michael Kelley (LINUX) Feb. 23, 2023, 9:15 p.m. UTC | #44
From: Sean Christopherson <seanjc@google.com> Sent: Thursday, February 23, 2023 1:08 PM
> 
> On Thu, Feb 23, 2023, Michael Kelley (LINUX) wrote:
> > From: Dave Hansen <dave.hansen@intel.com> Sent: Thursday, February 23, 2023
> 12:42 PM
> > >
> > > On 2/23/23 12:26, Dave Hansen wrote:
> > > >> +       if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) {
> > > >> +               /*
> > > >> +               * Ensure fixmaps for IOAPIC MMIO respect memory encryption pgprot
> > > >> +               * bits, just like normal ioremap():
> > > >> +               */
> > > >> +               if (x86_platform.hyper.is_private_mmio(phys))
> > > >> +                       flags = pgprot_encrypted(flags);
> > > >> +               else
> > > >> +                       flags = pgprot_decrypted(flags);
> > > >> +       }
> > > ...
> > > > It does seem a bit odd that there's a new CC_ATTR_GUEST_MEM_ENCRYPT
> > > > check wrapping this whole thing.  I guess the trip through
> > > > pgprot_decrypted() is harmless on normal platforms, though.
> > >
> > > Yeah, that's _really_ odd.  Sean, were you trying to optimize away the
> > > indirect call or something?
> 
> No, my thought was simply to require platforms that support GUEST_MEM_ENCRYPT
> to
> implement x86_platform.hyper.is_private_mmio, e.g. to avoid having to check if
> is_private_mmio is NULL, to explicit document that non-Hyper-V encrypted guests
> don't (yet) support private MMIO, and to add a bit of documentation around the
> {de,en}crypted logic.
> 
> > > I would just expect the Hyper-V/vTOM code to leave
> > > x86_platform.hyper.is_private_mmio alone unless it *knows* the platform has
> > > private MMIO *and* CC_ATTR_GUEST_MEM_ENCRYPT.
> >
> > Agreed.
> >
> > >
> > > Is there ever a case where CC_ATTR_GUEST_MEM_ENCRYPT==0 and he
> > > Hyper-V/vTOM code would need to set x86_platform.hyper.is_private_mmio?
> >
> > There's no such case.
> >
> > I agree that gating with CC_ATTR_GUEST_MEM_ENCRYPT isn't really necessary.
> > Current upstream code always does the pgprot_decrypted(), and as you said,
> > that's a no-op on platforms with no memory encryption.
> 
> Right, but since is_private_mmio can be NULL, unless I'm missing something we'll
> need an extra check no matter what, i.e. the alternative would be
> 
> 	if (x86_platform.hyper.is_private_mmio &&
> 	    x86_platform.hyper.is_private_mmio(phys))
> 		flags = pgprot_encrypted(flags);
> 	else
> 		flags = pgprot_decrypted(flags);
> 
> I have no objection to that approach.  It does have the advantage of not needing
> an indirect call for encrypted guests that don't support private MMIO, though
> I can't imagine this code is performance sensitive.

Or statically set a default stub function for is_private_mmio() that returns "false".
Then there's no need to check for NULL, and only platforms that want to use it
have to code anything.  Several other entries in x86_platform have such defaults.

Michael
Dave Hansen Feb. 23, 2023, 9:24 p.m. UTC | #45
On 2/23/23 13:15, Michael Kelley (LINUX) wrote:
> Or statically set a default stub function for is_private_mmio() that returns "false".
> Then there's no need to check for NULL, and only platforms that want to use it
> have to code anything.  Several other entries in x86_platform have such defaults.

Yeah, that's what I was thinking too, like 'x86_op_int_noop':

> struct x86_platform_ops x86_platform __ro_after_init = {
>         .calibrate_cpu                  = native_calibrate_cpu_early,
>         .calibrate_tsc                  = native_calibrate_tsc,
...
>         .hyper.pin_vcpu                 = x86_op_int_noop,

It's kinda silly to do an indirect call to a two-instruction function,
but this is a pretty slow path.
Borislav Petkov March 6, 2023, 9:51 p.m. UTC | #46
On Thu, Feb 23, 2023 at 12:27:50PM -0800, Dave Hansen wrote:
> On 2/23/23 12:01, Michael Kelley (LINUX) wrote:
> > Dave Hansen:  Are you also OK with Sean's proposal?  Looking for consensus
> > here ....
> 
> Yeah, I'm generally OK with it as long as Borislav is.

Right, I think we're ok with the following basic rules:

- pure arch/x86/ code should use the x86_platform function pointers to
  query hypervisor capabilities/peculiarities

- cc_platform_has() should be used in generic/driver code as it
  abstracts away the underlying platform better. IOW, querying
  x86_platform.... in generic, platform-agnostic driver code looks weird to
  say the least

The hope is that those two should be enough to support most guest types
and not let the zoo get too much out of hand...

Thx.
David Woodhouse March 9, 2023, 11:12 a.m. UTC | #47
On Mon, 2023-03-06 at 22:51 +0100, Borislav Petkov wrote:
> On Thu, Feb 23, 2023 at 12:27:50PM -0800, Dave Hansen wrote:
> > On 2/23/23 12:01, Michael Kelley (LINUX) wrote:
> > > Dave Hansen:  Are you also OK with Sean's proposal?  Looking for consensus
> > > here ....
> > 
> > Yeah, I'm generally OK with it as long as Borislav is.
> 
> Right, I think we're ok with the following basic rules:
> 
> - pure arch/x86/ code should use the x86_platform function pointers to
>   query hypervisor capabilities/peculiarities
> 
> - cc_platform_has() should be used in generic/driver code as it
>   abstracts away the underlying platform better. IOW, querying
>   x86_platform.... in generic, platform-agnostic driver code looks weird to
>   say the least
> 
> The hope is that those two should be enough to support most guest types
> and not let the zoo get too much out of hand...
> 
> Thx.

In 
https://lore.kernel.org/all/20230308171328.1562857-13-usama.arif@bytedance.com/
I added an sev_es_active() helper for x86 code.

Is that consistent with the vision here, or should I do something different?
Borislav Petkov March 9, 2023, 11:59 a.m. UTC | #48
First of all,

thanks for proactively pointing that out instead of simply using what's
there and we get to find out later, only by chance.

Much appreciated. :-)

On Thu, Mar 09, 2023 at 11:12:10AM +0000, David Woodhouse wrote:
> > Right, I think we're ok with the following basic rules:
> >
> > - pure arch/x86/ code should use the x86_platform function pointers to
> >   query hypervisor capabilities/peculiarities
> >
> > - cc_platform_has() should be used in generic/driver code as it
> >   abstracts away the underlying platform better. IOW, querying
> >   x86_platform.... in generic, platform-agnostic driver code looks weird to
> >   say the least
> >
> > The hope is that those two should be enough to support most guest types
> > and not let the zoo get too much out of hand...
> >
> > Thx.
>
> In
> https://lore.kernel.org/all/20230308171328.1562857-13-usama.arif@bytedance.com/
> I added an sev_es_active() helper for x86 code.
>
> Is that consistent with the vision here, or should I do something different?

So looking at sev_es_init_vc_handling() where we set that key, I'm
*thinking* that key can be removed now and the code should check

  cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)

instead.

Because if some of the checks in that function below fail, the guest
will terminate anyway.

Jörg, Tom?
David Woodhouse March 9, 2023, 1 p.m. UTC | #49
On Thu, 2023-03-09 at 12:59 +0100, Borislav Petkov wrote:
> 
> So looking at sev_es_init_vc_handling() where we set that key, I'm
> *thinking* that key can be removed now and the code should check
> 
>   cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)
> 
> instead.
> 
> Because if some of the checks in that function below fail, the guest
> will terminate anyway.
> 
> Jörg, Tom?

Hrm... the implication of that is that I should do something like this
in my own code. Is this really your intent?

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index b4265c5b46da..7ac4ec6415de 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1519,24 +1519,17 @@ void __init smp_prepare_cpus_common(void)
  */
 static bool prepare_parallel_bringup(void)
 {
-	bool has_sev_es = sev_es_active();
+	/*
+	 * The "generic" CC_ATTR_GUEST_STATE_ENCRYPT actually means specifically
+	 * SEV-ES, and only SEV-ES, and always shall mean that. If it's present,
+	 * that means the AP startup code should use the hard-coded SEV-ES GHCB
+	 * call to find its APIC ID (STARTUP_APICID_SEV_ES).
+	 */
+	bool has_sev_es = cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT);
 
 	if (IS_ENABLED(CONFIG_X86_32))
 		return false;
 
-	/*
-	 * Encrypted guests other than SEV-ES (in the future) will need to
-	 * implement an early way of finding the APIC ID, since they will
-	 * presumably block direct CPUID too. Be kind to our future selves
-	 * by warning here instead of just letting them break. Parallel
-	 * startup doesn't have to be in the first round of enabling patches
-	 * for any such technology.
-	 */
-	if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT) && !has_sev_es) {
-		pr_info("Disabling parallel bringup due to guest memory encryption\n");
-		return false;
-	}
-
 	if (x2apic_mode || has_sev_es) {
 		if (boot_cpu_data.cpuid_level < 0x0b)
 			return false;
Tom Lendacky March 9, 2023, 2:19 p.m. UTC | #50
On 3/9/23 05:59, Borislav Petkov wrote:
> First of all,
> 
> thanks for proactively pointing that out instead of simply using what's
> there and we get to find out later, only by chance.
> 
> Much appreciated. :-)
> 
> On Thu, Mar 09, 2023 at 11:12:10AM +0000, David Woodhouse wrote:
>>> Right, I think we're ok with the following basic rules:
>>>
>>> - pure arch/x86/ code should use the x86_platform function pointers to
>>>    query hypervisor capabilities/peculiarities
>>>
>>> - cc_platform_has() should be used in generic/driver code as it
>>>    abstracts away the underlying platform better. IOW, querying
>>>    x86_platform.... in generic, platform-agnostic driver code looks weird to
>>>    say the least
>>>
>>> The hope is that those two should be enough to support most guest types
>>> and not let the zoo get too much out of hand...
>>>
>>> Thx.
>>
>> In
>> https://lore.kernel.org/all/20230308171328.1562857-13-usama.arif@bytedance.com/
>> I added an sev_es_active() helper for x86 code.
>>
>> Is that consistent with the vision here, or should I do something different?
> 
> So looking at sev_es_init_vc_handling() where we set that key, I'm
> *thinking* that key can be removed now and the code should check
> 
>    cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)
> 
> instead.
> 
> Because if some of the checks in that function below fail, the guest
> will terminate anyway.
> 
> Jörg, Tom?

I believe Joerg added that key for performance reasons, since it is used 
on the exception path and can avoid all the calls to cc_platform_has(). I 
think that key should stay.

Maybe David can introduce an CC_ATTR_GUEST_SEV_ES attribute that returns 
true if the guest is an ES or SNP guest. Or do we introduce a 
CC_ATTR_PARALLEL_BOOT attribute that returns true for any SEV guest.

Then the "if cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT) && !has_sev_es" 
check in arch/x86/kernel/smpboot.c can be removed and the following check 
can become if (x2apic_mode || cc_platform_has(CC_ATTR_PARALLEL_BOOT))

Not sure how that affects a TDX guest, though.

Thanks,
Tom

>
Joerg Roedel March 9, 2023, 2:36 p.m. UTC | #51
On Thu, Mar 09, 2023 at 08:19:58AM -0600, Tom Lendacky wrote:
> I believe Joerg added that key for performance reasons, since it is used on
> the exception path and can avoid all the calls to cc_platform_has(). I think
> that key should stay.

Yes, that is right. The key is mainly for the NMI entry path which can
be performance relevant in some situations. For SEV-ES some special
handling is needed there to re-enable NMIs and adjust the #VC stack in
case it was raised on the VC-handlers entry path.

Regards,

	Joerg
Borislav Petkov March 9, 2023, 2:45 p.m. UTC | #52
On Thu, Mar 09, 2023 at 03:36:45PM +0100, Jörg Rödel wrote:
> Yes, that is right. The key is mainly for the NMI entry path which can
> be performance relevant in some situations. For SEV-ES some special
> handling is needed there to re-enable NMIs and adjust the #VC stack in
> case it was raised on the VC-handlers entry path.

So the performance argument is meh. That key will be replaced by

	if (cc_vendor == CC_VENDOR_AMD &&
	    cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)

which is something like 4 insns or so. Tops.

Haven't looked yet but it should be cheap.
David Woodhouse March 9, 2023, 3:45 p.m. UTC | #53
On Thu, 2023-03-09 at 15:45 +0100, Borislav Petkov wrote:
> On Thu, Mar 09, 2023 at 03:36:45PM +0100, Jörg Rödel wrote:
> > Yes, that is right. The key is mainly for the NMI entry path which can
> > be performance relevant in some situations. For SEV-ES some special
> > handling is needed there to re-enable NMIs and adjust the #VC stack in
> > case it was raised on the VC-handlers entry path.
> 
> So the performance argument is meh. That key will be replaced by
> 
>         if (cc_vendor == CC_VENDOR_AMD &&
>             cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)
> 
> which is something like 4 insns or so. Tops.
> 
> Haven't looked yet but it should be cheap.


cc_vendor isn't yet exposed. As we discussed this in IRC, I've been
updating the parallel bringup support for SEV-ES, including adding a
cc_get_vendor() function, in the top of my tree at
https://git.infradead.org/users/dwmw2/linux.git/commitdiff/parallel-6.2-v15
and it now looks like this:

	/*
	 * Encrypted guests other than SEV-ES (in the future) will need to
	 * implement an early way of finding the APIC ID, since they will
	 * presumably block direct CPUID too. Be kind to our future selves
	 * by warning here instead of just letting them break. Parallel
	 * startup doesn't have to be in the first round of enabling patches
	 * for any such technology.
	 */
	if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) {
		switch (cc_get_vendor()) {
		case CC_VENDOR_AMD:
			has_sev_es = true;
			break;

		default:
			pr_info("Disabling parallel bringup due to guest state encryption\n");
			return false;
		}
	}

Using an explicit CC_ATTR_NO_EARLY_CPUID flag instead of
CC_ATTR_GUEST_STATE_ENCRYPT which is merely an approximation of that,
might be interesting.
Borislav Petkov March 9, 2023, 4:34 p.m. UTC | #54
On Thu, Mar 09, 2023 at 03:45:35PM +0000, David Woodhouse wrote:
> cc_vendor isn't yet exposed. As we discussed this in IRC, I've been

Right, and we might just as well expose it because having
a setter/getter is kinda silly for a __ro_after_init variable, see
below.

So here's what I was able to scratch up quickly to remove the static
key.

The asm looks like this:

# ./arch/x86/include/asm/sev.h:156: 	if (cc_vendor == CC_VENDOR_AMD &&
	cmpl	$1, cc_vendor(%rip)	#, cc_vendor
	je	.L204	#,

...

.L204:
# ./arch/x86/include/asm/sev.h:157:         cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT))
        movl    $3, %edi        #,
        call    cc_platform_has #
# ./arch/x86/include/asm/sev.h:156:     if (cc_vendor == CC_VENDOR_AMD &&
        testb   %al, %al        # tmp134
        je      .L158   #,

and so I doubt that this is at all measureable comparing that to the
rest of the code that gets executed in the NMI handler. And it lets us
get rid of yet another static key and use only the CC APIs.

Oh, and it bitches because cc_platform_has() is being called in noinstr
region but we can mark it noinstr or add the drop-noinstr markers around
it, if needed. Not that important as that function and what it calls
don't do anything magical.

Thx.

---
diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 49b44f881484..34446383e68b 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -13,7 +13,7 @@
 #include <asm/coco.h>
 #include <asm/processor.h>
 
-static enum cc_vendor vendor __ro_after_init;
+enum cc_vendor cc_vendor __ro_after_init;
 static u64 cc_mask __ro_after_init;
 
 static bool intel_cc_platform_has(enum cc_attr attr)
@@ -83,7 +83,7 @@ static bool hyperv_cc_platform_has(enum cc_attr attr)
 
 bool cc_platform_has(enum cc_attr attr)
 {
-	switch (vendor) {
+	switch (cc_vendor) {
 	case CC_VENDOR_AMD:
 		return amd_cc_platform_has(attr);
 	case CC_VENDOR_INTEL:
@@ -105,7 +105,7 @@ u64 cc_mkenc(u64 val)
 	 * - for AMD, bit *set* means the page is encrypted
 	 * - for Intel *clear* means encrypted.
 	 */
-	switch (vendor) {
+	switch (cc_vendor) {
 	case CC_VENDOR_AMD:
 		return val | cc_mask;
 	case CC_VENDOR_INTEL:
@@ -118,7 +118,7 @@ u64 cc_mkenc(u64 val)
 u64 cc_mkdec(u64 val)
 {
 	/* See comment in cc_mkenc() */
-	switch (vendor) {
+	switch (cc_vendor) {
 	case CC_VENDOR_AMD:
 		return val & ~cc_mask;
 	case CC_VENDOR_INTEL:
@@ -131,7 +131,7 @@ EXPORT_SYMBOL_GPL(cc_mkdec);
 
 __init void cc_set_vendor(enum cc_vendor v)
 {
-	vendor = v;
+	cc_vendor = v;
 }
 
 __init void cc_set_mask(u64 mask)
diff --git a/arch/x86/include/asm/coco.h b/arch/x86/include/asm/coco.h
index 3d98c3a60d34..0563e07a1002 100644
--- a/arch/x86/include/asm/coco.h
+++ b/arch/x86/include/asm/coco.h
@@ -11,6 +11,8 @@ enum cc_vendor {
 	CC_VENDOR_INTEL,
 };
 
+extern enum cc_vendor cc_vendor;
+
 void cc_set_vendor(enum cc_vendor v);
 void cc_set_mask(u64 mask);
 
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index ebc271bb6d8e..1335781e4976 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -12,6 +12,7 @@
 #include <asm/insn.h>
 #include <asm/sev-common.h>
 #include <asm/bootparam.h>
+#include <asm/coco.h>
 
 #define GHCB_PROTOCOL_MIN	1ULL
 #define GHCB_PROTOCOL_MAX	2ULL
@@ -134,24 +135,26 @@ struct snp_secrets_page_layout {
 } __packed;
 
 #ifdef CONFIG_AMD_MEM_ENCRYPT
-extern struct static_key_false sev_es_enable_key;
 extern void __sev_es_ist_enter(struct pt_regs *regs);
 extern void __sev_es_ist_exit(void);
 static __always_inline void sev_es_ist_enter(struct pt_regs *regs)
 {
-	if (static_branch_unlikely(&sev_es_enable_key))
+	if (cc_vendor == CC_VENDOR_AMD &&
+	    cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT))
 		__sev_es_ist_enter(regs);
 }
 static __always_inline void sev_es_ist_exit(void)
 {
-	if (static_branch_unlikely(&sev_es_enable_key))
+	if (cc_vendor == CC_VENDOR_AMD &&
+	    cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT))
 		__sev_es_ist_exit();
 }
 extern int sev_es_setup_ap_jump_table(struct real_mode_header *rmh);
 extern void __sev_es_nmi_complete(void);
 static __always_inline void sev_es_nmi_complete(void)
 {
-	if (static_branch_unlikely(&sev_es_enable_key))
+	if (cc_vendor == CC_VENDOR_AMD &&
+	    cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT))
 		__sev_es_nmi_complete();
 }
 extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 679026a640ef..7d873bffbd8e 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -111,7 +111,6 @@ struct ghcb_state {
 };
 
 static DEFINE_PER_CPU(struct sev_es_runtime_data*, runtime_data);
-DEFINE_STATIC_KEY_FALSE(sev_es_enable_key);
 
 static DEFINE_PER_CPU(struct sev_es_save_area *, sev_vmsa);
 
@@ -1393,9 +1392,6 @@ void __init sev_es_init_vc_handling(void)
 			sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SNP_UNSUPPORTED);
 	}
 
-	/* Enable SEV-ES special handling */
-	static_branch_enable(&sev_es_enable_key);
-
 	/* Initialize per-cpu GHCB pages */
 	for_each_possible_cpu(cpu) {
 		alloc_runtime_data(cpu);
David Woodhouse March 10, 2023, 10:05 a.m. UTC | #55
On Thu, 2023-03-09 at 17:34 +0100, Borislav Petkov wrote:
> On Thu, Mar 09, 2023 at 03:45:35PM +0000, David Woodhouse wrote:
> > cc_vendor isn't yet exposed. As we discussed this in IRC, I've been
> 
> Right, and we might just as well expose it because having
> a setter/getter is kinda silly for a __ro_after_init variable, see
> below.
> 
> So here's what I was able to scratch up quickly to remove the static
> key.
> 
> The asm looks like this:

Seems reasonable. I might just let you iterate on that, drop the SNP-ES
part from the parallel boot series, then let Tom or someone send it in
later once the dust settles.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index e902564..72eb366 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -169,6 +169,8 @@  static inline unsigned int isa_virt_to_bus(volatile void *address)
 }
 #define isa_bus_to_virt		phys_to_virt
 
+extern void ioremap_set_encrypted_range(resource_size_t addr, unsigned long size);
+
 /*
  * The default ioremap() behavior is non-cached; if you need something
  * else, you probably want one of the following.
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 6453fba..8db5846 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -37,6 +37,10 @@  struct ioremap_desc {
 	unsigned int flags;
 };
 
+/* Range of "other" addresses to treat as encrypted when remapping */
+resource_size_t other_encrypted_start;
+resource_size_t other_encrypted_end;
+
 /*
  * Fix up the linear direct mapping of the kernel to avoid cache attribute
  * conflicts.
@@ -108,14 +112,35 @@  static unsigned int __ioremap_check_encrypted(struct resource *res)
 }
 
 /*
+ * Allow a hypervisor to specify an additional range of addresses to
+ * treat as encrypted when remapping.
+ */
+void ioremap_set_encrypted_range(resource_size_t addr, unsigned long size)
+{
+	other_encrypted_start = addr;
+	other_encrypted_end = addr + size - 1;
+}
+
+/*
  * The EFI runtime services data area is not covered by walk_mem_res(), but must
- * be mapped encrypted when SEV is active.
+ * be mapped encrypted when SEV is active. Also check the hypervisor specified
+ * "other" address range to treat as encrypted.
  */
 static void __ioremap_check_other(resource_size_t addr, struct ioremap_desc *desc)
 {
 	if (!cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
 		return;
 
+	/*
+	 * Check for an address within the "other" encrypted address range. If such
+	 * a range is set, it must include the entire space used by the device,
+	 * so we don't need to deal with a partial fit.
+	 */
+	if ((addr >= other_encrypted_start) && (addr <= other_encrypted_end)) {
+		desc->flags |= IORES_MAP_ENCRYPTED;
+		return;
+	}
+
 	if (!IS_ENABLED(CONFIG_EFI))
 		return;