Message ID | 20230320191956.1354602-1-jpiotrowski@linux.microsoft.com (mailing list archive) |
---|---|
Headers | show |
Series | Support ACPI PSP on Hyper-V | expand |
On Mon, Mar 20, 2023 at 07:19:48PM +0000, Jeremi Piotrowski wrote:
> This series is a prerequisite for nested SNP-host support on Hyper-V
I'm curious: what in the *world* is a sensible use case for doing this
thing at all?
On 22/03/2023 16:46, Borislav Petkov wrote: > On Mon, Mar 20, 2023 at 07:19:48PM +0000, Jeremi Piotrowski wrote: >> This series is a prerequisite for nested SNP-host support on Hyper-V > > I'm curious: what in the *world* is a sensible use case for doing this > thing at all? > This is actually not as crazy as it sounds. What this does is it allows a normal (non-SNP) VM to host confidential (SNP) VMs. I say "normal" but not every VM is going to be able to do this, it needs to be running on AMD hardware and configured to have access to VirtualizationExtensions, a "HardwareIsolation" capability, and given a number of "hardware isolated guests" that it is allowed to spawn. In practice this will result in the VM seeing a PSP device, SEV-SNP related CPUID leafs, and have access to additional memory management instructions (rmpadjust/psmash). This allows the rest of the of KVM-SNP support to work. So instead of taking a bare-metal AMD server with 128 CPUs to run confidential workloads you'll be able to provision an Azure VM with say 8 CPUs and run up to 8 SNP guests nested inside it. It's also useful for development, I participate in the kata-containers project where we're doing confidential-containers related work, and having access to test VMs to run SNP guests is going to make things much easier. If you're interested, I posted the other half of the patches required some time back: https://lore.kernel.org/lkml/20230213103402.1189285-1-jpiotrowski@linux.microsoft.com/#t Jeremi
On Wed, Mar 22, 2023 at 06:33:37PM +0100, Jeremi Piotrowski wrote: > What this does is it allows a normal (non-SNP) VM to host confidential (SNP) > VMs. I say "normal" but not every VM is going to be able to do this, it needs If you say "non-SNP" VM then this sounds like purely for development. Because I cannot see how you're going to give the confidentiality guarantee to the SNP guests if the lower level is unencrypted, non-SNP and so on... > to be running on AMD hardware and configured to have access to > VirtualizationExtensions, a "HardwareIsolation" capability, and given a number > of "hardware isolated guests" that it is allowed to spawn. In practice this > will result in the VM seeing a PSP device, SEV-SNP related CPUID > leafs, and have access to additional memory management instructions > (rmpadjust/psmash). This allows the rest of the of KVM-SNP support to > work. So why don't you emulate the PSP in KVM instead of doing some BIOS hack? And multiplex the access to it between all the parties needing it?
On 3/22/2023 7:15 PM, Borislav Petkov wrote: > On Wed, Mar 22, 2023 at 06:33:37PM +0100, Jeremi Piotrowski wrote: >> What this does is it allows a normal (non-SNP) VM to host confidential (SNP) >> VMs. I say "normal" but not every VM is going to be able to do this, it needs > > If you say "non-SNP" VM then this sounds like purely for development. > Because I cannot see how you're going to give the confidentiality > guarantee to the SNP guests if the lower level is unencrypted, non-SNP > and so on... Not at all. Just to be clear: this lights up all the same bits of SNP as it does on bare-metal, none of it is emulated away. On bare-metal the hypervisor underneath the SNP guest is unencrypted as well. Here the stack is: L0 (Hyper-V), L1 (KVM) and L2 (SNP guest). Starting an SNP guest is the same and involves sending commands to the PSP: * SNP_GCTX_CREATE * SNP_LAUNCH_START * SNP_LAUNCH_UPDATE * SNP_LAUNCH_FINISH Pages need to be assigned to a specific L2 SNP guest in the system-wide "reverse map table", at which point neither L0 nor L1 hypervisor can touch them. Every L2 SNP guests memory is encrypted with a different key, and the SNP guest can fetch a hardware signed attestation report from the PSP that includes a hash of all the pages that were loaded (and encrypted) into the VM address space at the time the VM was launched. The communication channel between L2 guest and PSP is secured using keys that the PSP injects into the SNP guest's address space at launch time. Honestly, I find it pretty cool that you can stuff a whole extra hypervisor underneath the SNP guest, and the hardware will still ensure and attest to the fact that neither hypervisor is able to compromise the integrity and confidentiality of the VM enclave. And you can verify this claim independently. > >> to be running on AMD hardware and configured to have access to >> VirtualizationExtensions, a "HardwareIsolation" capability, and given a number >> of "hardware isolated guests" that it is allowed to spawn. In practice this >> will result in the VM seeing a PSP device, SEV-SNP related CPUID >> leafs, and have access to additional memory management instructions >> (rmpadjust/psmash). This allows the rest of the of KVM-SNP support to >> work. > > So why don't you emulate the PSP in KVM instead of doing some BIOS hack? > And multiplex the access to it between all the parties needing it? > Not sure I follow you here. The quoted paragraph talks about what the L1 VM (KVM) sees. The L1 VM needs to issue PSP commands to bring up an L2 SNP guest, and later the L1 VM relays SNP guest commands to the PSP. The PSP commands are multiplexed to the physical PSP by the L0 hypervisor (Hyper-V). So Hyper-V exposes a PSP to the L1 VM because it is needed and it is compatible with the existing Linux driver that handles the PSP. The way it is exposed (ACPI table) follows how it was specified by AMD.
On Thu, Mar 23, 2023 at 03:46:22PM +0100, Jeremi Piotrowski wrote: > Not at all. Just to be clear: this lights up all the same bits of SNP > as it does on bare-metal, none of it is emulated away. On bare-metal the > hypervisor underneath the SNP guest is unencrypted as well. Here the stack > is: L0 (Hyper-V), L1 (KVM) and L2 (SNP guest). Yeah, I talked to folks after sending that email yesterday. Apparently it is ok to do that without compromising SNP guest security but I, in my eternal paranoia, somehow don't have the warm and fuzzy feeling about it. > ... The communication channel between L2 guest and PSP is secured > using keys that the PSP injects into the SNP guest's address space at > launch time. Yeah, all the levels below L2 are required to do it set up env properly so that L2 SNP guests can run. > Honestly, I find it pretty cool that you can stuff a whole extra hypervisor > underneath the SNP guest, Whatever floats your boat. :-) As long as it doesn't mess up my interrupt setup code with crazy hacks. > Not sure I follow you here. The quoted paragraph talks about what the L1 > VM (KVM) sees. The L1 VM needs to issue PSP commands to bring up an L2 SNP > guest, and later the L1 VM relays SNP guest commands to the PSP. The > PSP commands are multiplexed to the physical PSP by the L0 hypervisor > (Hyper-V). > > So Hyper-V exposes a PSP to the L1 VM because it is needed and it is > compatible with the existing Linux driver that handles the PSP. The way > it is exposed (ACPI table) follows how it was specified by AMD. No no, it was specified by Microsoft architects. So, that same interface to the PSP can be done by L0 emulating a standard ACPI device for the KVM L1 HV and then L1 can use the normal ACPI interrupt #9. What's the need for supplying all that other gunk like destination ID, interrupt vector and so on? Thx.
On 3/23/2023 4:23 PM, Borislav Petkov wrote: > On Thu, Mar 23, 2023 at 03:46:22PM +0100, Jeremi Piotrowski wrote: >> Not at all. Just to be clear: this lights up all the same bits of SNP >> as it does on bare-metal, none of it is emulated away. On bare-metal the >> hypervisor underneath the SNP guest is unencrypted as well. Here the stack >> is: L0 (Hyper-V), L1 (KVM) and L2 (SNP guest). > > Yeah, I talked to folks after sending that email yesterday. Apparently > it is ok to do that without compromising SNP guest security but I, in my > eternal paranoia, somehow don't have the warm and fuzzy feeling about > it. > >> ... The communication channel between L2 guest and PSP is secured >> using keys that the PSP injects into the SNP guest's address space at >> launch time. > > Yeah, all the levels below L2 are required to do it set up env properly > so that L2 SNP guests can run. > >> Honestly, I find it pretty cool that you can stuff a whole extra hypervisor >> underneath the SNP guest, > > Whatever floats your boat. :-) > > As long as it doesn't mess up my interrupt setup code with crazy hacks. > >> Not sure I follow you here. The quoted paragraph talks about what the L1 >> VM (KVM) sees. The L1 VM needs to issue PSP commands to bring up an L2 SNP >> guest, and later the L1 VM relays SNP guest commands to the PSP. The >> PSP commands are multiplexed to the physical PSP by the L0 hypervisor >> (Hyper-V). >> >> So Hyper-V exposes a PSP to the L1 VM because it is needed and it is >> compatible with the existing Linux driver that handles the PSP. The way >> it is exposed (ACPI table) follows how it was specified by AMD. > > No no, it was specified by Microsoft architects. > > So, that same interface to the PSP can be done by L0 emulating > a standard ACPI device for the KVM L1 HV and then L1 can use the normal > ACPI interrupt #9. > That same interface is exposed by physical hardware+firmware to the underlying Hyper-V. So it wasn't a matter of Microsoft architects coming up with a guest-host interface but rather exposing the virtual hardware in the same way as on a physical server. > What's the need for supplying all that other gunk like destination ID, > interrupt vector and so on? I'm not sure what drove the design decisions that led to the interface looking the way it does. What I can do is put in the work to map it into kernel constructs in the most native way possible and in a way that doesn't look or feel like a crazy hack. > > Thx. >
On Thu, Mar 23, 2023 at 05:11:26PM +0100, Jeremi Piotrowski wrote: > That same interface is exposed by physical hardware+firmware to the underlying > Hyper-V. Let me see if I understand it correctly: Hyper-V *baremetal* is using the same ASPT spec to to talk to the *physical* PSP device? Is that ASPT interface to talk to the PSP used by the L0 hypervisor? Or does the L0 HV have a normal driver, similar to the Linux one, without the functionality this ASPT spec provides? > So it wasn't a matter of Microsoft architects coming up with a > guest-host interface but rather exposing the virtual hardware in the same > way as on a physical server. So if you want to expose the same interface to the L1 guest, why isn't Hyper-V emulating an ACPI device just like any other functionality? Why does it need to reach into the interrupt handling internals? I'd expect that the L0 HV would emulate a PSP device, the L1 would simply load the Linux PSP device driver and everything should just work. What's the point of that alternate access at all? But I might still be missing something...
On 3/23/2023 5:34 PM, Borislav Petkov wrote: > On Thu, Mar 23, 2023 at 05:11:26PM +0100, Jeremi Piotrowski wrote: >> That same interface is exposed by physical hardware+firmware to the underlying >> Hyper-V. > > Let me see if I understand it correctly: Hyper-V *baremetal* is using > the same ASPT spec to to talk to the *physical* PSP device? > Yes > Is that ASPT interface to talk to the PSP used by the L0 hypervisor? > Yes (unless I am mistaken, this is the same statement as above). > Or does the L0 HV have a normal driver, similar to the Linux one, > without the functionality this ASPT spec provides? > The L0 HV relies on the ASPT spec/interface to map registers and setup interrupts and then uses a protocol driver to handle the PSP command set (like the Linux one). >> So it wasn't a matter of Microsoft architects coming up with a >> guest-host interface but rather exposing the virtual hardware in the same >> way as on a physical server. > > So if you want to expose the same interface to the L1 guest, why isn't > Hyper-V emulating an ACPI device just like any other functionality? Why > does it need to reach into the interrupt handling internals? > The primary stack for nested SNP support is Hyper-V-on-Hyper-V. By exposing the PSP device to the L1 guest in the same way (as the L0), everything can done in the exact same way as on bare-metal. I just really want nested SNP support to work in KVM-on-Hyper-V as well so that's why I'm adding support for these things. Also: if Linux were to run bare-metal on that hardware it would need to be able to handle the PSP through the ASPT interface as well. > I'd expect that the L0 HV would emulate a PSP device, the L1 would > simply load the Linux PSP device driver and everything should just work. > > What's the point of that alternate access at all? > So it's actually great that you made me ask around because I learned something that will help: Since the AMD PSP is a privileged device, there is a desire to not have to trust the ACPI stack, and instead rely fully on static ACPI tables for discovery and configuration. This also applies to the AMD IOMMU. If you look at iommu_setup_intcapxt() in drivers/iommu/amd/init.c, it does exactly the things that are needed to setup the PSP interrupt too. Here's a link to the patch that added that: https://lore.kernel.org/all/20201111144322.1659970-3-dwmw2@infradead.org/#t So my plan now is to post a v4 with proper irq_domain handling. Ok Thomas? Best wishes, Jeremi
On Fri, Mar 24, 2023 at 06:10:09PM +0100, Jeremi Piotrowski wrote: > Since the AMD PSP is a privileged device, there is a desire to not have to trust the > ACPI stack, And yet you do: + err = acpi_parse_aspt(&res[0], &pdata); + if (err) + return err; You don't trust the ACPI stack, and yet you're parsing an ACPI table?!?! You have to make up your mind here. Btw, you still haven't answered my question about doing: devm_request_irq(dev, 9, ..) where 9 is the default ACPI interrupt. You can have some silly table tell you what to map or you can simply map IRQ 9 and be done with it. In this second case you can *really* not trust ACPI because you know which IRQ it is.
On Sun, Apr 02 2023 at 17:44, Borislav Petkov wrote: > On Fri, Mar 24, 2023 at 06:10:09PM +0100, Jeremi Piotrowski wrote: >> Since the AMD PSP is a privileged device, there is a desire to not have to trust the >> ACPI stack, > > And yet you do: > > + err = acpi_parse_aspt(&res[0], &pdata); > + if (err) > + return err; > > You don't trust the ACPI stack, and yet you're parsing an ACPI table?!?! > You have to make up your mind here. > > Btw, you still haven't answered my question about doing: > > devm_request_irq(dev, 9, ..) > > where 9 is the default ACPI interrupt. > > You can have some silly table tell you what to map or you can simply map > IRQ 9 and be done with it. In this second case you can *really* not > trust ACPI because you know which IRQ it is. The real problem here is that the information provided about the overall design and requirements is close to zero. All we heard so far is hand waving about not trusting PCI and ACPI. Jeremi, can you please describe exactly what the design and constraints are in understandable and coherent sentences? Thanks, tglx
On 4/3/2023 8:20 AM, Thomas Gleixner wrote: > On Sun, Apr 02 2023 at 17:44, Borislav Petkov wrote: >> On Fri, Mar 24, 2023 at 06:10:09PM +0100, Jeremi Piotrowski wrote: >>> Since the AMD PSP is a privileged device, there is a desire to not have to trust the >>> ACPI stack, >> >> And yet you do: >> >> + err = acpi_parse_aspt(&res[0], &pdata); >> + if (err) >> + return err; >> >> You don't trust the ACPI stack, and yet you're parsing an ACPI table?!?! >> You have to make up your mind here. >> >> Btw, you still haven't answered my question about doing: >> >> devm_request_irq(dev, 9, ..) >> >> where 9 is the default ACPI interrupt. >> >> You can have some silly table tell you what to map or you can simply map >> IRQ 9 and be done with it. In this second case you can *really* not >> trust ACPI because you know which IRQ it is Will respond to this mail directly. > > The real problem here is that the information provided about the overall > design and requirements is close to zero. All we heard so far is hand > waving about not trusting PCI and ACPI. That's not a fair characterization Thomas, but I will turn the other cheek. > > Jeremi, can you please describe exactly what the design and constraints > are in understandable and coherent sentences? > Here goes, I will keep it as simple as I can. The goal of these patches is to operate all the hardware interfaces required to run AMD SEV-SNP VMs, but in the context of a Linux VM running on top of Hyper-V. This Linux VM is called the SNP-host VM. All the patches I submit target the SNP-host VM kernel, which uses KVM to bring up SEV-SNP VMs. To get SEV-SNP working you need to combine this work with AMD's KVM SEV-SNP patches. I posted two patch sets: one that extends AMD's patches, and one that is independent of them (this one here) that could be merged sooner. Here are the design constraints: 1. the interfaces exposed to the SNP-host VM to operate SEV-SNP match real hardware interface specifications defined by AMD. This is because we are emulating/virtualizing a hardware feature, and not some made up virtual thing. 2. the SNP-host VM may run either Windows(Hyper-V) or Linux, so the SEV-SNP interfaces need to be supported by both. 3. Hyper-V Generation 2 VMs do not have a PCI bus. The SNP-host VM must be a Hyper-V Gen 2 VM. One of the components needed to operate SEV-SNP is the Platform Security Processor (PSP), aka AMD Secure Processor (ASP). The PSP is the root-of-trust on AMD systems. The PSP is specified as being discoverable either on the PCI bus, or through the presence of an ACPI table with the "ASPT" (AMD Secure Processor Table) signature. Here goes the design: Constraint 1 means that only the two specified ways of discovering and configuring a PSP inside the SNP-host VM were in the running: PCI or ASPT. Constraint 3 means that the PCI version of the PSP is not a viable option. Additionally, the ASPT is used on AMD hardware in Microsoft datacenters, which means it is supported in Hyper-V (constraint 2). The outcome is that the SNP-host VM sees an ASPT. The ASPT provides the following information: memory range of PSP registers and offsets of individual PSP registers inside that memory range. There are 7 registers: - 6 are related to the "command submission" portion of the PSP; the ccp module knows how to operate those. - the last one, "ACPI CmdResp" register, is used to configure the PSP interrupt to the OS. The PSP interrupt configuration through the "ACPI CmdResp" register takes the following information: - APIC ID - interrupt vector - destination mode (physical/logical) - message type (fixed/lowest priority) So to hook this up with the Linux device model I wrote patches that do the following: Detect the ASPT table, extract information and register a "psp" platform_device for the "ccp" module to bind to. Create an irq_domain and encapsulate dealing with the PSP interrupt register there, so that the "ccp" module has an irq number that it passes to request_irq(). There is an "if (hypervisor == Hyper-V)" check before the ASPT table detection. Here is the reasoning behind that: According to AMD specifications the *same* PSP may be discoverable both through ASPT and on the PCI bus. In that case, if the ASPT is to be used the OS is supposed to disable the "PCI interface" through the "ACPI CmdResp" register, which will result in no PCI-MSI interrupts, BAR writes ignored, BAR reads return all 0xF's. I can't verify whether that would work correctly, so in the interest of not breaking other users, the ASPT handling is hidden behind the hypervisor check. There is nothing Hyper-V specific about any of this code, it supports a hardware interface present in server grade hardware and would work on physical hardware if when (not if) someone removes the condition. That's all there is to it. All the other information I gave is background information that I hoped would help better understand the setting. The most relevant piece of information is the one that I came across last. You asked "what makes this PSP device special". The PSP is the root-of-trust on the system, it controls memory encryption keys, it can encrypt/decrypt individual memory pages. SEV-SNP ties together a lot of system components and requires enabling support for it in the AMD IOMMU too, which is presumably why the PSP gets the same special treatment (as the AMD IOMMU). The ASPT and AMD PSP interrupt configuration through the "ACPI CmdResp" register is based on a similar design of the AMD IOMMU. The AMD IOMMU is: - discovered through the presence of the IVRS ACPI table - the MMIO address of the IOMMU is parsed out of the IVRS table - if x2APIC support is enabled, the IOMMU interrupts are delivered based on programming APIC-ID+vector+destination mode into an interrupt control register in IOMMU MMIO space. This causes any PCI-MSI configuration present for the IOMMU to be ignored. - Linux supports and uses that interrupt delivery mechanism. It is implemented as an irq_domain. Do you think it makes sense to include parts of the above description in cover letter commit message? Thanks, Jeremi
On 4/3/2023 8:20 AM, Thomas Gleixner wrote: > On Sun, Apr 02 2023 at 17:44, Borislav Petkov wrote: >> On Fri, Mar 24, 2023 at 06:10:09PM +0100, Jeremi Piotrowski wrote: >>> Since the AMD PSP is a privileged device, there is a desire to not have to trust the >>> ACPI stack, >> >> And yet you do: >> >> + err = acpi_parse_aspt(&res[0], &pdata); >> + if (err) >> + return err; >> >> You don't trust the ACPI stack, and yet you're parsing an ACPI table?!?! >> You have to make up your mind here. I gave you background on why Microsoft system designers like to use the ASPT on *physical hardware* in our datacenters. It is because it allows them to setup a highly privileged system component through an isolated ACPI table, without needing to depend on the *rest of the ACPI stack* (other ACPI tables and/or the ACPI interpreter). The same reason they use IVRS for AMD IOMMU. I thought it might be good to write this down, as this shows that the ASPT is a hardware interface that has *some* value. I don't think further discussion on this point helps us make forward progress. We're trying to adhere to a specification for a physical device when modeling that same device in a virtual environment. Yes, this requires parsing an ACPI table. >> >> Btw, you still haven't answered my question about doing: >> >> devm_request_irq(dev, 9, ..) >> >> where 9 is the default ACPI interrupt. >> >> You can have some silly table tell you what to map or you can simply map >> IRQ 9 and be done with it. In this second case you can *really* not >> trust ACPI because you know which IRQ it is. > So I originally thought I answered when i said "because we're trying to not deviate from the hardware specification for the PSP". Interrupt configuration is part of that specification. But when I think about what you're suggesting, I can interpret it two ways: 1. Configure the PSP to raise the vector corresponding to ACPI IRQ 9. This might work and would look similar to the first version I posted. I'd fetch 'struct irq_cfg' for acpi_sci_irq, write the corresponding APIC-ID/vector into the PSP, enable PSP interrupt generation and then probe the "ccp" driver so that it can call "devm_request_irq(9)". I assume this would also require registering an irq affinity notifier, much like drivers/iommu/amd/init.c did before commit d1adcfbb520c. 2. Deviate from the hardware specification. From reading acpi code (not at all an expert on this), that "9" does not look like a static value to me, so it requires either: a) passing a GSI number in an ACPI table b) defining it as being the same interrupt as the SCI, which comes from the FADT table. c) using the GPE mechanism of the ACPI SCI interrupt. So I'd need to define a third way for the PSP to interrupt the OS, one that would only be supported on Hyper-V. Work with our hypervisor and/or virtual firmware teams to make sure that the PSP model supports generating the interrupt in this way. Work with the Windows team to make Windows support it (the same virtual hardware model/virtual firmware is used regardless of the OS). I have no objection to doing "1." if it works. I don't see it as a big win over using an irq_domain. I don't think "2." is a reasonable thing to ask. We do regularly make suggestions to hypervisor/firmware teams on how to make things better supported in Linux without requiring hacks. But modelling a piece of hardware in a custom way to avoid following hardware specs is questionable. I also think that soon, when other people deploy more SEV-SNP hardware in their datacenters, they will also want to rely on the ASPT for the reasons listed at the top of the email, so we'll be adding support for it anyway. Which way do you suggest we go Boris? I'm not attached to the code at all but I am attached to adhering to hardware specifications. I can try to do "1." or stick with the irq_domain approach that i posted. Thanks, Jeremi
On 4/2/2023 5:44 PM, Borislav Petkov wrote: > On Fri, Mar 24, 2023 at 06:10:09PM +0100, Jeremi Piotrowski wrote: >> Since the AMD PSP is a privileged device, there is a desire to not have to trust the >> ACPI stack, > > And yet you do: > > + err = acpi_parse_aspt(&res[0], &pdata); > + if (err) > + return err; > > You don't trust the ACPI stack, and yet you're parsing an ACPI table?!?! > You have to make up your mind here. > > Btw, you still haven't answered my question about doing: > > devm_request_irq(dev, 9, ..) > > where 9 is the default ACPI interrupt. > > You can have some silly table tell you what to map or you can simply map > IRQ 9 and be done with it. In this second case you can *really* not > trust ACPI because you know which IRQ it is. > Sorry I broke threading. Meant to post this email: https://lore.kernel.org/lkml/35f6b321-1668-2b62-cb47-3f3760be2e1d@linux.microsoft.com/#t as a reply to *this* one. Jeremi
On 4/5/2023 9:56 AM, Jeremi Piotrowski wrote: > On 4/3/2023 8:20 AM, Thomas Gleixner wrote: >> On Sun, Apr 02 2023 at 17:44, Borislav Petkov wrote: >>> On Fri, Mar 24, 2023 at 06:10:09PM +0100, Jeremi Piotrowski wrote: >>>> Since the AMD PSP is a privileged device, there is a desire to not have to trust the >>>> ACPI stack, >>> >>> And yet you do: >>> >>> + err = acpi_parse_aspt(&res[0], &pdata); >>> + if (err) >>> + return err; >>> >>> You don't trust the ACPI stack, and yet you're parsing an ACPI table?!?! >>> You have to make up your mind here. >>> >>> Btw, you still haven't answered my question about doing: >>> >>> devm_request_irq(dev, 9, ..) >>> >>> where 9 is the default ACPI interrupt. >>> >>> You can have some silly table tell you what to map or you can simply map >>> IRQ 9 and be done with it. In this second case you can *really* not >>> trust ACPI because you know which IRQ it is > > Will respond to this mail directly. > >> >> The real problem here is that the information provided about the overall >> design and requirements is close to zero. All we heard so far is hand >> waving about not trusting PCI and ACPI. > > That's not a fair characterization Thomas, but I will turn the other cheek. > >> >> Jeremi, can you please describe exactly what the design and constraints >> are in understandable and coherent sentences? >> > > Here goes, I will keep it as simple as I can. > > The goal of these patches is to operate all the hardware interfaces required > to run AMD SEV-SNP VMs, but in the context of a Linux VM running on top of > Hyper-V. This Linux VM is called the SNP-host VM. All the patches I submit > target the SNP-host VM kernel, which uses KVM to bring up SEV-SNP VMs. To get > SEV-SNP working you need to combine this work with AMD's KVM SEV-SNP patches. > I posted two patch sets: one that extends AMD's patches, and one that is > independent of them (this one here) that could be merged sooner. > > Here are the design constraints: > 1. the interfaces exposed to the SNP-host VM to operate SEV-SNP match real > hardware interface specifications defined by AMD. This is because we are > emulating/virtualizing a hardware feature, and not some made up virtual > thing. > > 2. the SNP-host VM may run either Windows(Hyper-V) or Linux, so the SEV-SNP > interfaces need to be supported by both. > > 3. Hyper-V Generation 2 VMs do not have a PCI bus. The SNP-host VM must be a > Hyper-V Gen 2 VM. > > One of the components needed to operate SEV-SNP is the Platform Security > Processor (PSP), aka AMD Secure Processor (ASP). The PSP is the root-of-trust on > AMD systems. The PSP is specified as being discoverable either on the PCI bus, > or through the presence of an ACPI table with the "ASPT" (AMD Secure Processor > Table) signature. > > Here goes the design: > Constraint 1 means that only the two specified ways of discovering and > configuring a PSP inside the SNP-host VM were in the running: PCI or ASPT. > Constraint 3 means that the PCI version of the PSP is not a viable option. > Additionally, the ASPT is used on AMD hardware in Microsoft datacenters, which > means it is supported in Hyper-V (constraint 2). The outcome is that the > SNP-host VM sees an ASPT. > > The ASPT provides the following information: memory range of PSP registers and > offsets of individual PSP registers inside that memory range. There are 7 > registers: > - 6 are related to the "command submission" portion of the PSP; the ccp module > knows how to operate those. > - the last one, "ACPI CmdResp" register, is used to configure the PSP interrupt > to the OS. > > The PSP interrupt configuration through the "ACPI CmdResp" register takes the > following information: > - APIC ID > - interrupt vector > - destination mode (physical/logical) > - message type (fixed/lowest priority) > > So to hook this up with the Linux device model I wrote patches that do the > following: > Detect the ASPT table, extract information and register a "psp" platform_device > for the "ccp" module to bind to. > Create an irq_domain and encapsulate dealing with the PSP interrupt register > there, so that the "ccp" module has an irq number that it passes to > request_irq(). > > There is an "if (hypervisor == Hyper-V)" check before the ASPT table detection. > Here is the reasoning behind that: > According to AMD specifications the *same* PSP may be discoverable both through > ASPT and on the PCI bus. In that case, if the ASPT is to be used the OS is supposed > to disable the "PCI interface" through the "ACPI CmdResp" register, which will > result in no PCI-MSI interrupts, BAR writes ignored, BAR reads return all 0xF's. > I can't verify whether that would work correctly, so in the interest of not > breaking other users, the ASPT handling is hidden behind the hypervisor check. > There is nothing Hyper-V specific about any of this code, it supports a hardware > interface present in server grade hardware and would work on physical hardware if > when (not if) someone removes the condition. > > That's all there is to it. > > All the other information I gave is background information that I hoped would > help better understand the setting. The most relevant piece of information is the > one that I came across last. You asked "what makes this PSP device special". The PSP > is the root-of-trust on the system, it controls memory encryption keys, it can > encrypt/decrypt individual memory pages. SEV-SNP ties together a lot of system components > and requires enabling support for it in the AMD IOMMU too, which is presumably why > the PSP gets the same special treatment (as the AMD IOMMU). The ASPT and AMD PSP interrupt > configuration through the "ACPI CmdResp" register is based on a similar design of the AMD IOMMU. > The AMD IOMMU is: > - discovered through the presence of the IVRS ACPI table > - the MMIO address of the IOMMU is parsed out of the IVRS table > - if x2APIC support is enabled, the IOMMU interrupts are delivered based on > programming APIC-ID+vector+destination mode into an interrupt control register > in IOMMU MMIO space. This causes any PCI-MSI configuration present for the > IOMMU to be ignored. > - Linux supports and uses that interrupt delivery mechanism. It is implemented > as an irq_domain. > > Do you think it makes sense to include parts of the above description in cover letter > commit message? > > Thanks, > Jeremi Hi Thomas, Have you had a chance to review this? Thanks, Jeremi
Jeremi! On Wed, Apr 05 2023 at 09:56, Jeremi Piotrowski wrote: > On 4/3/2023 8:20 AM, Thomas Gleixner wrote: First of all. Thanks for writing this up! > The goal of these patches is to operate all the hardware interfaces required > to run AMD SEV-SNP VMs, but in the context of a Linux VM running on top of > Hyper-V. This Linux VM is called the SNP-host VM. All the patches I submit > target the SNP-host VM kernel, which uses KVM to bring up SEV-SNP VMs. To get > SEV-SNP working you need to combine this work with AMD's KVM SEV-SNP patches. > I posted two patch sets: one that extends AMD's patches, and one that is > independent of them (this one here) that could be merged sooner. > > Here are the design constraints: > 1. the interfaces exposed to the SNP-host VM to operate SEV-SNP match real > hardware interface specifications defined by AMD. This is because we are > emulating/virtualizing a hardware feature, and not some made up virtual > thing. Hardware/firmware folks design a lot of interfaces which are not well thought out. The kernel has refused to implement support for those in the past. It's part of our development and review process to understand the rationale behind these interfaces and if they do not make sense, tell the vendor to fix them before we set them into stone and have to support them forever. And this interface _is_ fixable because it's a firmware interface and not something cast in silicon. Firmware interfaces are versioned and Linux has enough examples of not supporting early versions of such interfaces. I'm not saying it's wrong, but the lack of rationale makes me cautious. > 2. the SNP-host VM may run either Windows(Hyper-V) or Linux, so the SEV-SNP > interfaces need to be supported by both. > > 3. Hyper-V Generation 2 VMs do not have a PCI bus. The SNP-host VM must be a > Hyper-V Gen 2 VM. I wonder how that correlates with the patch series which adds PCI pass through support to Hyper-V Confidential VMs.... https://lore.kernel.org/lkml/1679838727-87310-1-git-send-email-mikelley@microsoft.com But that's just me being confused about a gazillion hyperv related patch series which all fiddle something in the name of confudential computing. It's also not really relevant to the problem at hand. > One of the components needed to operate SEV-SNP is the Platform Security > Processor (PSP), aka AMD Secure Processor (ASP). The PSP is the root-of-trust on > AMD systems. The PSP is specified as being discoverable either on the PCI bus, > or through the presence of an ACPI table with the "ASPT" (AMD Secure Processor > Table) signature. > > Here goes the design: > Constraint 1 means that only the two specified ways of discovering and > configuring a PSP inside the SNP-host VM were in the running: PCI or ASPT. > Constraint 3 means that the PCI version of the PSP is not a viable option. > Additionally, the ASPT is used on AMD hardware in Microsoft datacenters, which > means it is supported in Hyper-V (constraint 2). The outcome is that the > SNP-host VM sees an ASPT. > > The ASPT provides the following information: memory range of PSP registers and > offsets of individual PSP registers inside that memory range. There are 7 > registers: > - 6 are related to the "command submission" portion of the PSP; the ccp module > knows how to operate those. > - the last one, "ACPI CmdResp" register, is used to configure the PSP interrupt > to the OS. > > The PSP interrupt configuration through the "ACPI CmdResp" register takes the > following information: > - APIC ID > - interrupt vector > - destination mode (physical/logical) > - message type (fixed/lowest priority) This part is exactly where I started questioning, as it requires to provide the exact data which can be written into the X2APIC ICR MSR, which is not necessarily the most brilliant abstraction and evades interrupt remapping completely on bare metal. > There is nothing Hyper-V specific about any of this code, it supports a hardware > interface present in server grade hardware and would work on physical hardware if > when (not if) someone removes the condition. This is _not_ a hardware interface, it's a firmware interface. The memory window is just the transport so the OS side can talk to the PSP firmware provided interface. An interface with a specification which has never seen the scrutiny of kernel developers and maintainers before you started posting these patches. The ASPT documentation, which I saw the first time when you provided the link, describes that interface but is completely void of any rationale. That's not your fault of course. > You asked "what makes this PSP device special". The PSP is the > root-of-trust on the system, it controls memory encryption keys, it > can encrypt/decrypt individual memory pages. I'm well aware what the PSP is. My question was: Why does it need special treatment for interrupts? > SEV-SNP ties together a lot of system components and requires enabling > support for it in the AMD IOMMU too, which is presumably why the PSP > gets the same special treatment (as the AMD IOMMU). That's a fallacy. The PSP when exposed via PCI, is not treated special. It's just assigned a regular MSI message which is composed by the IOMMUs interrupt remapping units irqdomain. That PCI device is not a real PCI device. It's a PCI shim which is enumerated via PCI and provides the usual config space and bars, but the back-end is not what we assume if we read PCI. That's true for a lot of integrated devices on x86 (all vendors) and it's that way because PCI is a very convenient and (most of the time) consistent way of enumeration and configuration. This ASPT/PSP mechanism just creates a different form of enumeration and works completely independent of PCI. The table provides the physical base address of the memory window and the register offsets in that window. That's not really much different from PCI which provides the window base address in a PCI bar and has hard-coded device ID dependent register offsets. What's actually different is how the PSP interrupt is configured in the non PCI case because it obviously can't use PCI/MSI[-x], but it still could utilize the generic concept of MSI in theory. > The ASPT and AMD PSP interrupt configuration through the "ACPI > CmdResp" register is based on a similar design of the AMD IOMMU. Sorry no. Just because X does something does not mean that Y, which wants to do something similar, is based on the same design. > The AMD IOMMU is: > - discovered through the presence of the IVRS ACPI table > - the MMIO address of the IOMMU is parsed out of the IVRS table > - if x2APIC support is enabled, the IOMMU interrupts are delivered based on > programming APIC-ID+vector+destination mode into an interrupt control register > in IOMMU MMIO space. This causes any PCI-MSI configuration present for the > IOMMU to be ignored. That's not entirely correct. - Interrupt remapping requires x2APIC support. - If interrupt remapping is enabled then the interrupts of the IOMMU and remapping unit, which deliver faults and errors, cannot go through the remapping unit itself for obvious reasons. So it has to have a mechanism which allows it to deliver an interrupt to a particular destination directly w/o going through its possibly faulty self. Obviously Linux supports that mechanism otherwise there would be no interrupt remapping support on Linux at all. There is a very concise technical reason for this mechanism, but IOMMU and PSP are technically completely different entities and the existance of the IOMMU mechanims does not make an argument at all, that the PSP firmware device is modeled the same way and needs to be treated the same way. PSP does not have the same requirement as the IOMMU. Otherwise it could not work at all with the PCI interface which sends its interrupt message through the interrupt remapping unit, unless the translation mechanism does too ugly to envision nasties. The direct firmware interface, which is just based on an ioremap'ed address window instead of a shim PCI device, requires suddenly a different way to configure the interrupts: It requires to provide a full extended APIC-ID, the vector and the control bits, ready for consumption to write into the x2APIC ICR MSR. Which in turn, when running on bare metal evades the interrupt remapping unit completely. So it is _not_ the same thing as the PCI variant, which handles bog-standard remapped MSI message format just fine. It would also handle non-remapped format just fine _if_ IOMMU & interrupt remapping would not be mandatory for SEV[-SNP]. There might be a concise technical reason why the direct firmware interface can't use a regular MSI message, requires the plain x2APIC ICR data and why it's required that the direct firmware interface can evade interrupt remapping on bare metal, but so far nobody provided one. Jeremi, I'm not asking you, to provide that. There are enough AMD people on Cc who are in a better place to answer that question. It's their specification and their firmware after all. Though while I wrote all of this up, I found actually a technical reason: The shim PCI device has obviously a device ID, aka PCI BDF (Bus, Device, Function), which allows the IOMMU/remapping code to find the associated IOMMU and remapping table. The table is associated to the device so the remapping unit can validate whether a particular interrupt message is originated from the device associated to it. The non-PCI variant does not have a device ID. That could probably be solved, like it is solved for IOAPIC and HPET, but that requires at least software support for IOMMU/remapping and might even require a change in hardware as far as my limited understanding goes. Whether that's worth it, is a completely different question. As a consequence, this variant as of today cannot send interrupts through the generic MSI mechanism which is routed through the IOMMU/remapping unit and the only remaining option is to issue interrupts directly via ICR, aka IPI. This makes a _concise_ technical argument for the interface provided under the following assumptions: - It's not worth to address that device ID problem because there is no real value as the PSP device is considered to be "correct". - I'm not completely off track with my analysis Let's assume those assumptions hold. Still the existence of the IOMMU mechanism does not make an argument for the PSP case on it's own. Those two are two completely different reasons. The consequence that both need a special irqdomain is the same, but that's it. See? It's a sad state of affairs, that I had to decipher that myself, instead of AMD folks providing this information in the documentation upfront or at least having the courtesy of providing it in the context of this discussion. That would have spared a lot of wasted time. But why do I complain? The concept of proper hardware/software co-design, which was postulated at least 40 years ago, is still either unknown or in its infancy at the vast majority of silicon vendors including my own employer. The main concept is still to throw hardware/firmware over the fence and let software folks deal with it. That's a complete disaster and paves the way to death by complexity and unmaintainability. As a consequence the only way for a responsible kernel maintainer is to question the design at the point where patches are posted. Therefore it's not unreasonable to ask for a rationale and concise technical arguments at that point. If the provided information does not make sense and the interface still can be adjusted, as it is the case with pure firmware interfaces, then there is no justification for hand-wavy arguments based on presumptions and assumptions, really. Again, I'm not blaming Jeremi, who has the same problem just at the other side of the fence. First he has to make it work based on some meager documentation and then he has to argue himself blue based on that same meager documentation. Can silicon folks finally get their act together and accept the fact that the upstream Linux kernel is not there to cater to their technical brain fart of the day? It's the other way around. Silicon vendors rely on first class support by the kernel, so it's their obligation to: - integrate upstream into their specification process _upfront_ - provide concise technical documentation - take responsibility for the kernel as a whole IOW, the Linux kernel community has to be considered as their primary "customer" simply because _most_ of their actually paying important customers are depending on that. Offloading this after the fact to paying customers who want to enable some new feature, whether it's well thought out or not, is really not the way to go. It's just wasting the time of _everyone_ who is involved, except for those vendor associated folks who stand by and ignore or silently watch the discussions others have to fight on their behalf. Thanks, tglx