Message ID | 20230907181459.18145-3-ankita@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: arm64: support write combining and cachable IO memory in VMs | expand |
+ Lorenzo On Thu, Sep 07, 2023 at 11:14:59AM -0700, ankita@nvidia.com wrote: > From: Ankit Agrawal <ankita@nvidia.com> > > Linux allows device drivers to map IO memory on a per-page basis using > "write combining" or WC. This is often done using > pgprot_writecombing(). The driver knows which pages can support WC > access and the proper programming model to generate this IO. Generally > the use case is to boost performance by using write combining to > generate larger PCIe MemWr TLPs. > > Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for > all IO memory. This puts the VM in charge of the memory attributes, > and removes the KVM override to DEVICE_nGnRE. > > Ultimately this makes pgprot_writecombing() work correctly in VMs and > allows drivers like mlx5 to fully operate their HW. > > After some discussions with ARM and CPU architects we reached the > conclusion there was no need for KVM to prevent the VM from selecting > between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear > that NORMAL_NC could result in uncontained failures, but upon deeper > analysis it turns out there are already possible cases for uncontained > failures with DEVICE types too. Ultimately the platform must be > implemented in a way that ensures that all DEVICE_* and NORMAL_NC > accesses have no uncontained failures. > > Fortunately real platforms do tend to implement this. > > This patch makes the VM's memory attributes behave as follows: > > S1 | S2 | Result > NORMAL-WB | NORMAL-NC | NORMAL-NC > NORMAL-WT | NORMAL-NC | NORMAL-NC > NORMAL-NC | NORMAL-NC | NORMAL-NC > DEVICE<attr> | NORMAL-NC | DEVICE<attr> > > See section D8.5.5 of DDI0487_I_a_a-profile_architecture_reference_manual.pdf > for details. > > Signed-off-by: Ankit Agrawal <ankita@nvidia.com> From the discussions with the hardware people in Arm and Nvidia, we indeed concluded that relaxing S2 to Normal-NC is not any worse than Device (and actually Device memory is more prone to generating uncontained errors, something to do with the Device memory ordering semantics). For this change: Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
On Thu, Sep 07, 2023 at 11:14:59AM -0700, ankita@nvidia.com wrote: > From: Ankit Agrawal <ankita@nvidia.com> > > Linux allows device drivers to map IO memory on a per-page basis using > "write combining" or WC. This is often done using > pgprot_writecombing(). The driver knows which pages can support WC pgprot_writecombine() ? > access and the proper programming model to generate this IO. Generally > the use case is to boost performance by using write combining to > generate larger PCIe MemWr TLPs. First off, this changeset does not affect *only* Linux guests, obviously. I understand that's the use case you are after but this change is targeting all VMs, it must be clear. Then WC and mapping to PCI TLPs, either you describe that in details (NormalNC vs device-nGnRE and resulting SystemBus<->PCI transactions) or you don't describe it at all, as it stands I don't know how to use this information. > Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for > all IO memory. This puts the VM in charge of the memory attributes, > and removes the KVM override to DEVICE_nGnRE. > > Ultimately this makes pgprot_writecombing() work correctly in VMs and pgprot_writecombine() ? > allows drivers like mlx5 to fully operate their HW. > > After some discussions with ARM and CPU architects we reached the > conclusion there was no need for KVM to prevent the VM from selecting > between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear > that NORMAL_NC could result in uncontained failures, but upon deeper > analysis it turns out there are already possible cases for uncontained > failures with DEVICE types too. Ultimately the platform must be > implemented in a way that ensures that all DEVICE_* and NORMAL_NC > accesses have no uncontained failures. I would reorder/rephrase this changelog as follows: - Describe what the problem is (ie KVM default s2 mappings) - Describe how you are solving it - Add a link to the documentation that states why it is safe to do that and the resulting s1/s2 mappings combination It must be clear why from a legacy standpoint this is a safe change to apply. > Fortunately real platforms do tend to implement this. Remove this sentence, it adds no information for someone who is chasing bugs or just wants to understand the change itself. > This patch makes the VM's memory attributes behave as follows: "The resulting stage 1 and stage 2 memory types and cacheability attributes are as follows": > S1 | S2 | Result > NORMAL-WB | NORMAL-NC | NORMAL-NC > NORMAL-WT | NORMAL-NC | NORMAL-NC > NORMAL-NC | NORMAL-NC | NORMAL-NC > DEVICE<attr> | NORMAL-NC | DEVICE<attr> > > See section D8.5.5 of DDI0487_I_a_a-profile_architecture_reference_manual.pdf > for details. D8.5 I assume. I commented on the log because it must be easy to follow, it is an important change (and there is a lot of background information required to understand it - please report it). The patch itself must be reviewed by arm64/kvm maintainers; the changes (given the background work that led to them) seem fine to me (please CC me on next versions). Thanks Lorenzo > Signed-off-by: Ankit Agrawal <ankita@nvidia.com> > --- > arch/arm64/include/asm/memory.h | 2 ++ > arch/arm64/kvm/hyp/pgtable.c | 2 +- > 2 files changed, 3 insertions(+), 1 deletion(-) > > diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h > index fde4186cc387..c247e5f29d5a 100644 > --- a/arch/arm64/include/asm/memory.h > +++ b/arch/arm64/include/asm/memory.h > @@ -147,6 +147,7 @@ > * Memory types for Stage-2 translation > */ > #define MT_S2_NORMAL 0xf > +#define MT_S2_NORMAL_NC 0x5 > #define MT_S2_DEVICE_nGnRE 0x1 > > /* > @@ -154,6 +155,7 @@ > * Stage-2 enforces Normal-WB and Device-nGnRE > */ > #define MT_S2_FWB_NORMAL 6 > +#define MT_S2_FWB_NORMAL_NC 5 > #define MT_S2_FWB_DEVICE_nGnRE 1 > > #ifdef CONFIG_ARM64_4K_PAGES > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > index ccd291b6893d..a80949002191 100644 > --- a/arch/arm64/kvm/hyp/pgtable.c > +++ b/arch/arm64/kvm/hyp/pgtable.c > @@ -696,7 +696,7 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p > kvm_pte_t *ptep) > { > bool device = prot & KVM_PGTABLE_PROT_DEVICE; > - kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, DEVICE_nGnRE) : > + kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, NORMAL_NC) : > KVM_S2_MEMATTR(pgt, NORMAL); > u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS; > > -- > 2.17.1 >
On Mon, Sep 11, 2023 at 04:57:51PM +0200, Lorenzo Pieralisi wrote: > On Thu, Sep 07, 2023 at 11:14:59AM -0700, ankita@nvidia.com wrote: > > From: Ankit Agrawal <ankita@nvidia.com> > > > > Linux allows device drivers to map IO memory on a per-page basis using > > "write combining" or WC. This is often done using > > pgprot_writecombing(). The driver knows which pages can support WC > > pgprot_writecombine() ? > > > access and the proper programming model to generate this IO. Generally > > the use case is to boost performance by using write combining to > > generate larger PCIe MemWr TLPs. > > First off, this changeset does not affect *only* Linux guests, obviously. I think everyone understands that. It can be clarified. > I understand that's the use case you are after but this change is > targeting all VMs, it must be clear. > > Then WC and mapping to PCI TLPs, either you describe that in details > (NormalNC vs device-nGnRE and resulting SystemBus<->PCI transactions) or > you don't describe it at all, as it stands I don't know how to use > this information. How about another pargraph: KVM prevents all VMs (including Linux) from accessing NORMAL_NC mappings, which is how Linux implements pgprot_writecombine(). This prevents using this performance optimization within VMs. I don't think we need to go into details how it works beyond that it requires NORMAL_NC. > > Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for > > all IO memory. This puts the VM in charge of the memory attributes, > > and removes the KVM override to DEVICE_nGnRE. > > > > Ultimately this makes pgprot_writecombing() work correctly in VMs and > > pgprot_writecombine() ? > > > allows drivers like mlx5 to fully operate their HW. > > > > After some discussions with ARM and CPU architects we reached the > > conclusion there was no need for KVM to prevent the VM from selecting > > between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear > > that NORMAL_NC could result in uncontained failures, but upon deeper > > analysis it turns out there are already possible cases for uncontained > > failures with DEVICE types too. Ultimately the platform must be > > implemented in a way that ensures that all DEVICE_* and NORMAL_NC > > accesses have no uncontained failures. > > I would reorder/rephrase this changelog as follows: > > - Describe what the problem is (ie KVM default s2 mappings) The problem is that pgprot_writecombine() doesn't work in Linux VMs. That is the first pagraph. > - Describe how you are solving it That is the middle paragraph "Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis" > - Add a link to the documentation that states why it is safe to do > that and the resulting s1/s2 mappings combination AFAIK there is no documentation beyond the combining rules. Exactly what should happen in various error conditions is implementation defined. Catalin did you ever find anything? > It must be clear why from a legacy standpoint this is a safe change > to apply. This is why: > > Fortunately real platforms do tend to implement this. It is why it is safe today, because real platforms don't throw uncontained errors from typical PCI accesses that VFIO allows. I think the conclusions was it turns out that is just because they don't do errors at all, not because DEVICE_* prevents it. > Remove this sentence, it adds no information for someone who > is chasing bugs or just wants to understand the change itself. So, if you hit a bug here you might evaluate if there is something wrong with your platform, ie it is allowing uncontained errors in unexpected places. Jason
On Mon, Sep 11, 2023 at 02:20:01PM -0300, Jason Gunthorpe wrote: > On Mon, Sep 11, 2023 at 04:57:51PM +0200, Lorenzo Pieralisi wrote: > > On Thu, Sep 07, 2023 at 11:14:59AM -0700, ankita@nvidia.com wrote: > > > From: Ankit Agrawal <ankita@nvidia.com> > > > > > > Linux allows device drivers to map IO memory on a per-page basis using > > > "write combining" or WC. This is often done using > > > pgprot_writecombing(). The driver knows which pages can support WC > > > > pgprot_writecombine() ? > > > > > access and the proper programming model to generate this IO. Generally > > > the use case is to boost performance by using write combining to > > > generate larger PCIe MemWr TLPs. > > > > First off, this changeset does not affect *only* Linux guests, obviously. > > I think everyone understands that. It can be clarified. OK, this is not a Linux guest fix *only*, everyone understands that but I would rather make sure that's crystal clear. > > I understand that's the use case you are after but this change is > > targeting all VMs, it must be clear. > > > > Then WC and mapping to PCI TLPs, either you describe that in details > > (NormalNC vs device-nGnRE and resulting SystemBus<->PCI transactions) or > > you don't describe it at all, as it stands I don't know how to use > > this information. > > How about another pargraph: > > KVM prevents all VMs (including Linux) from accessing NORMAL_NC > mappings, which is how Linux implements pgprot_writecombine(). This > prevents using this performance optimization within VMs. > > I don't think we need to go into details how it works beyond that it > requires NORMAL_NC. I think it is worth summarizing the discussion that led to this change, I can write something up. > > > Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for > > > all IO memory. This puts the VM in charge of the memory attributes, > > > and removes the KVM override to DEVICE_nGnRE. > > > > > > Ultimately this makes pgprot_writecombing() work correctly in VMs and > > > > pgprot_writecombine() ? > > > > > allows drivers like mlx5 to fully operate their HW. > > > > > > After some discussions with ARM and CPU architects we reached the > > > conclusion there was no need for KVM to prevent the VM from selecting > > > between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear > > > that NORMAL_NC could result in uncontained failures, but upon deeper > > > analysis it turns out there are already possible cases for uncontained > > > failures with DEVICE types too. Ultimately the platform must be > > > implemented in a way that ensures that all DEVICE_* and NORMAL_NC > > > accesses have no uncontained failures. > > > > I would reorder/rephrase this changelog as follows: > > > > - Describe what the problem is (ie KVM default s2 mappings) > > The problem is that pgprot_writecombine() doesn't work in Linux > VMs. That is the first pagraph. s/pagraph/paragraph Well to me that's how the problem was spotted but OK, I can rewrite the log and post it here, NP. > > - Describe how you are solving it > > That is the middle paragraph "Allow VMs to select DEVICE_* or > NORMAL_NC on a page by page basis" > > > - Add a link to the documentation that states why it is safe to do > > that and the resulting s1/s2 mappings combination > > AFAIK there is no documentation beyond the combining rules. Exactly > what should happen in various error conditions is implementation > defined. Catalin did you ever find anything? > > > It must be clear why from a legacy standpoint this is a safe change > > to apply. > > This is why: > > > > Fortunately real platforms do tend to implement this. Is it possible please to describe what "this" is in details ? What does "real platforms" mean ? Let's describe what HW/SW should be implemented to make this safe. > It is why it is safe today, because real platforms don't throw "real platforms", I have a problem with that, see above. > uncontained errors from typical PCI accesses that VFIO allows. I think > the conclusions was it turns out that is just because they don't do > errors at all, not because DEVICE_* prevents it. > > > Remove this sentence, it adds no information for someone who > > is chasing bugs or just wants to understand the change itself. > > So, if you hit a bug here you might evaluate if there is something > wrong with your platform, ie it is allowing uncontained errors in > unexpected places. Developers looking at commit log in the future must be able to understand why it was a sound change to make. As it stands IMO the commit log does not explain it fully. I can write up the commit log and post it if I manage to summarize it any better - more important the review on the code (that was already provided), I will try to write something up asap. Thanks, Lorenzo
On Wed, Sep 13, 2023 at 05:26:01PM +0200, Lorenzo Pieralisi wrote: > > > I understand that's the use case you are after but this change is > > > targeting all VMs, it must be clear. > > > > > > Then WC and mapping to PCI TLPs, either you describe that in details > > > (NormalNC vs device-nGnRE and resulting SystemBus<->PCI transactions) or > > > you don't describe it at all, as it stands I don't know how to use > > > this information. > > > > How about another pargraph: > > > > KVM prevents all VMs (including Linux) from accessing NORMAL_NC > > mappings, which is how Linux implements pgprot_writecombine(). This > > prevents using this performance optimization within VMs. > > > > I don't think we need to go into details how it works beyond that it > > requires NORMAL_NC. > > I think it is worth summarizing the discussion that led to this change, > I can write something up. We tried here, in short the reason we all understood it was like this today is because there was a belief that DEVICE_nGnRE allowed fewer/none uncontained failures than NORMAL_NC. AFIACT it turns out the specs are unclear and actual real world IP from ARM does not behave this way. We learned that at best they are equally unsafe, even perhaps NORMAL_NC is bit safer. What makes it safe in real chips is not the unclear specification, or the behavior of the ARM IP to generate uncontained failures, but because the combination of all IP in the platform never triggers an uncontained error anyhow. For instance PCIe integration IP does not transform PCIe TLP failures into uncontained errors. They generate all ones data on the bus instead. There is no spec that says this should happen, it is just what people do - c&p from Intel. On top of that we see that server focused designs built to run VMs have put in the RAS work to ensure uncontained errors can't exist and truely plug this hole. Thus, at the end of the day the safety of KVM and VFIO is effectively an unknowable platform specific property no matter what choice we make in SW here. Thus let's make the choice that enables the broadest VM functionality. > > > > Fortunately real platforms do tend to implement this. > > Is it possible please to describe what "this" is in details ? "prevent uncontained errors for DEVICE_* and NORMAL_NC access" > What does "real platforms" mean ? Actual real world deployments of VFIO and KVM in production that expect to be safe from uncontained errors. > Let's describe what HW/SW should be implemented to make this > safe. The platform must not generate uncontained errors :) > I can write up the commit log and post it if I manage to summarize > it any better - more important the review on the code (that was already > provided), I will try to write something up asap. Thank you! Jason
On Wed, Sep 13, 2023 at 03:54:54PM -0300, Jason Gunthorpe wrote: > On Wed, Sep 13, 2023 at 05:26:01PM +0200, Lorenzo Pieralisi wrote: [...] > > I can write up the commit log and post it if I manage to summarize > > it any better - more important the review on the code (that was already > > provided), I will try to write something up asap. > > Thank you! > > Jason FWIW, I have come up with the commit log below - please review and scrutinize/change it as deemed fit - it is not necessarily clearer than this one and it definitely requires MarcZ/Catalin/Will attention before it can be considered: --- Currently, KVM for ARM64 maps at stage 2 memory that is considered device (ie using pfn_is_map_memory() to discern between device memory and memory itself) with DEVICE_nGnRE memory attributes; this setting overrides (as per the ARM architecture [1]) any device MMIO mapping present at stage 1, resulting in a set-up whereby a guest operating system can't determine device MMIO mapping memory attributes on its own but it is always overriden by the KVM stage 2 default. This set-up does not allow guest operating systems to map device memory on a page by page basis with combined attributes other than DEVICE_nGnRE, which turns out to be an issue in that guest operating systems (eg Linux) may request to map devices MMIO regions with memory attributes that guarantee better performance (eg gathering attribute - that for some devices can generate larger PCIe memory writes TLPs) and specific operations (eg unaligned transactions) such as the NormalNC memory type. The default device stage 2 mapping was chosen in KVM for ARM64 since it was considered safer (ie it would not allow guests to trigger uncontained failures ultimately crashing the machine) but this turned out to be imprecise. Failures containability is a property of the platform and is independent from the memory type used for MMIO device memory mappings (ie DEVICE_nGnRE memory type is even more problematic than NormalNC in terms of containability since eg aborts triggered on loads cannot be made synchronous, which make them harder to contain); this means that, regardless of the combined stage1+stage2 mappings a platform is safe if and only if device transactions cannot trigger uncontained failures; reworded, the default KVM device stage 2 memory attributes play no role in making device assignment safer for a given platform and therefore can be relaxed. For all these reasons, relax the KVM stage 2 device memory attributes from DEVICE_nGnRE to NormalNC. This puts guests in control (thanks to stage1+stage2 combined memory attributes rules [1]) of device MMIO regions memory mappings, according to the rules described in [1] and summarized here ([(S1) = Stage1][(S2) = Stage2]): S1 | S2 | Result NORMAL-WB | NORMAL-NC | NORMAL-NC NORMAL-WT | NORMAL-NC | NORMAL-NC NORMAL-NC | NORMAL-NC | NORMAL-NC DEVICE<attr> | NORMAL-NC | DEVICE<attr> [1] section D8.5 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf
On Tue, Sep 26, 2023 at 10:31:38AM +0200, Lorenzo Pieralisi wrote: > On Wed, Sep 13, 2023 at 03:54:54PM -0300, Jason Gunthorpe wrote: > > On Wed, Sep 13, 2023 at 05:26:01PM +0200, Lorenzo Pieralisi wrote: > > [...] > > > > I can write up the commit log and post it if I manage to summarize > > > it any better - more important the review on the code (that was already > > > provided), I will try to write something up asap. > > > > Thank you! > > > > Jason > > FWIW, I have come up with the commit log below - please review and > scrutinize/change it as deemed fit - it is not necessarily clearer > than this one and it definitely requires MarcZ/Catalin/Will attention > before it can be considered: I have no issue with this language. Thanks, Jason
On Tue, Sep 26, 2023 at 10:31:38AM +0200, Lorenzo Pieralisi wrote: > Currently, KVM for ARM64 maps at stage 2 memory that is > considered device (ie using pfn_is_map_memory() to discern > between device memory and memory itself) with DEVICE_nGnRE > memory attributes; this setting overrides (as per the ARM > architecture [1]) any device MMIO mapping present at stage > 1, resulting in a set-up whereby a guest operating system > can't determine device MMIO mapping memory attributes on its > own but it is always overriden by the KVM stage 2 default. > > This set-up does not allow guest operating systems to map > device memory on a page by page basis with combined attributes > other than DEVICE_nGnRE, Well, it also has the option of DEVICE_nGnRnE ;). > which turns out to be an issue in that > guest operating systems (eg Linux) may request to map > devices MMIO regions with memory attributes that guarantee > better performance (eg gathering attribute - that for some > devices can generate larger PCIe memory writes TLPs) > and specific operations (eg unaligned transactions) such as > the NormalNC memory type. > > The default device stage 2 mapping was chosen in KVM > for ARM64 since it was considered safer (ie it would > not allow guests to trigger uncontained failures > ultimately crashing the machine) but this turned out > to be imprecise. > > Failures containability is a property of the platform > and is independent from the memory type used for MMIO > device memory mappings (ie DEVICE_nGnRE memory type is > even more problematic than NormalNC in terms of containability > since eg aborts triggered on loads cannot be made synchronous, > which make them harder to contain); this means that, > regardless of the combined stage1+stage2 mappings a > platform is safe if and only if device transactions cannot trigger > uncontained failures; reworded, the default KVM device > stage 2 memory attributes play no role in making device > assignment safer for a given platform and therefore can > be relaxed. > > For all these reasons, relax the KVM stage 2 device > memory attributes from DEVICE_nGnRE to NormalNC. > > This puts guests in control (thanks to stage1+stage2 > combined memory attributes rules [1]) of device MMIO > regions memory mappings, according to the rules > described in [1] and summarized here ([(S1) = Stage1][(S2) = Stage2]): > > �S1���������� |�� S2��������� |� Result > �NORMAL-WB����|� NORMAL-NC����|� NORMAL-NC > �NORMAL-WT����|� NORMAL-NC����|� NORMAL-NC > �NORMAL-NC����|� NORMAL-NC����|� NORMAL-NC > �DEVICE<attr>�|� NORMAL-NC����|� DEVICE<attr> Not sure what's wrong with my font setup as I can't see the above table but I know it from the Arm ARM. Anyway, the text looks fine to me. Thanks for putting it together Lorenzo. One thing not mentioned here is that pci-vfio still maps such memory as Device-nGnRnE in user space and relaxing this potentially creates an alias. But such alias is only relevant of both the VMM and the VM try to access the same device which I doubt is a realistic scenario.
On Tue, Sep 26, 2023 at 02:52:13PM +0100, Catalin Marinas wrote: > On Tue, Sep 26, 2023 at 10:31:38AM +0200, Lorenzo Pieralisi wrote: > > Currently, KVM for ARM64 maps at stage 2 memory that is > > considered device (ie using pfn_is_map_memory() to discern > > between device memory and memory itself) with DEVICE_nGnRE > > memory attributes; this setting overrides (as per the ARM > > architecture [1]) any device MMIO mapping present at stage > > 1, resulting in a set-up whereby a guest operating system > > can't determine device MMIO mapping memory attributes on its > > own but it is always overriden by the KVM stage 2 default. > > > > This set-up does not allow guest operating systems to map > > device memory on a page by page basis with combined attributes > > other than DEVICE_nGnRE, > > Well, it also has the option of DEVICE_nGnRnE ;). That's true - we have to fix it up. "This set-up does not allow guest operating systems to choose device memory attributes on a page by page basis independently from KVM stage 2 mappings,..." > > which turns out to be an issue in that > > guest operating systems (eg Linux) may request to map > > devices MMIO regions with memory attributes that guarantee > > better performance (eg gathering attribute - that for some > > devices can generate larger PCIe memory writes TLPs) > > and specific operations (eg unaligned transactions) such as > > the NormalNC memory type. > > > > The default device stage 2 mapping was chosen in KVM > > for ARM64 since it was considered safer (ie it would > > not allow guests to trigger uncontained failures > > ultimately crashing the machine) but this turned out > > to be imprecise. > > > > Failures containability is a property of the platform > > and is independent from the memory type used for MMIO > > device memory mappings (ie DEVICE_nGnRE memory type is > > even more problematic than NormalNC in terms of containability > > since eg aborts triggered on loads cannot be made synchronous, > > which make them harder to contain); this means that, > > regardless of the combined stage1+stage2 mappings a > > platform is safe if and only if device transactions cannot trigger > > uncontained failures; reworded, the default KVM device > > stage 2 memory attributes play no role in making device > > assignment safer for a given platform and therefore can > > be relaxed. > > > > For all these reasons, relax the KVM stage 2 device > > memory attributes from DEVICE_nGnRE to NormalNC. > > > > This puts guests in control (thanks to stage1+stage2 > > combined memory attributes rules [1]) of device MMIO > > regions memory mappings, according to the rules > > described in [1] and summarized here ([(S1) = Stage1][(S2) = Stage2]): > > > > �S1���������� |�� S2��������� |� Result > > �NORMAL-WB����|� NORMAL-NC����|� NORMAL-NC > > �NORMAL-WT����|� NORMAL-NC����|� NORMAL-NC > > �NORMAL-NC����|� NORMAL-NC����|� NORMAL-NC > > �DEVICE<attr>�|� NORMAL-NC����|� DEVICE<attr> > > Not sure what's wrong with my font setup as I can't see the above table > but I know it from the Arm ARM. > > Anyway, the text looks fine to me. Thanks for putting it together > Lorenzo. > > One thing not mentioned here is that pci-vfio still maps such memory as > Device-nGnRnE in user space and relaxing this potentially creates an > alias. But such alias is only relevant of both the VMM and the VM try to > access the same device which I doubt is a realistic scenario. I can update the log and repost, fixing the comment above as well (need to check what's wrong with the table characters). Thanks, Lorenzo
On Tue, Sep 26, 2023 at 02:52:13PM +0100, Catalin Marinas wrote: [...] > Anyway, the text looks fine to me. Thanks for putting it together > Lorenzo. Thanks ! > One thing not mentioned here is that pci-vfio still maps such memory as > Device-nGnRnE in user space and relaxing this potentially creates an > alias. But such alias is only relevant of both the VMM and the VM try to > access the same device which I doubt is a realistic scenario. A revised log, FWIW: --- Currently, KVM for ARM64 maps at stage 2 memory that is considered device (ie it is not RAM) with DEVICE_nGnRE memory attributes; this setting overrides (as per the ARM architecture [1]) any device MMIO mapping present at stage 1, resulting in a set-up whereby a guest operating system can't determine device MMIO mapping memory attributes on its own but it is always overriden by the KVM stage 2 default. This set-up does not allow guest operating systems to select device memory attributes on a page by page basis independently from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage 2 memory type attributes"), which turns out to be an issue in that guest operating systems (eg Linux) may request to map devices MMIO regions with memory attributes that guarantee better performance (eg gathering attribute - that for some devices can generate larger PCIe memory writes TLPs) and specific operations (eg unaligned transactions) such as the NormalNC memory type. The default device stage 2 mapping was chosen in KVM for ARM64 since it was considered safer (ie it would not allow guests to trigger uncontained failures ultimately crashing the machine) but this turned out to be imprecise. Failures containability is a property of the platform and is independent from the memory type used for MMIO device memory mappings (ie DEVICE_nGnRE memory type is even more problematic than NormalNC in terms of containability since eg aborts triggered on loads cannot be made synchronous, which make them harder to contain); this means that, regardless of the combined stage1+stage2 mappings a platform is safe if and only if device transactions cannot trigger uncontained failures; reworded, the default KVM device stage 2 memory attributes play no role in making device assignment safer for a given platform and therefore can be relaxed. For all these reasons, relax the KVM stage 2 device memory attributes from DEVICE_nGnRE to NormalNC. This puts guests in control (thanks to stage1+stage2 combined memory attributes rules [1]) of device MMIO regions memory mappings, according to the rules described in [1] and summarized here ([(S1) - stage1], [(S2) - stage 2]): S1 | S2 | Result NORMAL-WB | NORMAL-NC | NORMAL-NC NORMAL-WT | NORMAL-NC | NORMAL-NC NORMAL-NC | NORMAL-NC | NORMAL-NC DEVICE<attr> | NORMAL-NC | DEVICE<attr> It is worth noting that currently, to map devices MMIO space to user space in a device pass-through use case the VFIO framework applies memory attributes derived from pgprot_noncached() settings applied to VMAs, which result in device-nGnRnE memory attributes for the stage-1 VMM mappings. This means that a userspace mapping for device MMIO space carried out with the current VFIO framework and a guest OS mapping for the same MMIO space may result in a mismatched alias as described in [2]. Defaulting KVM device stage-2 mappings to Normal-NC attributes does not change anything in this respect, in that the mismatched aliases would only affect (refer to [2] for a detailed explanation) ordering between the userspace and GuestOS mappings resulting stream of transactions (ie it does not cause loss of property for either stream of transactions on its own), which is harmless given that the userspace and GuestOS access to the device is carried out through independent transactions streams. [1] section D8.5 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf [2] section B2.8 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf
On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote: > On Tue, Sep 26, 2023 at 02:52:13PM +0100, Catalin Marinas wrote: > > [...] > > > Anyway, the text looks fine to me. Thanks for putting it together > > Lorenzo. > > Thanks ! > > > One thing not mentioned here is that pci-vfio still maps such memory as > > Device-nGnRnE in user space and relaxing this potentially creates an > > alias. But such alias is only relevant of both the VMM and the VM try to > > access the same device which I doubt is a realistic scenario. > > A revised log, FWIW: What is the plan here, do you want Ankit to resend the series with this text? There were no comments on patch 1/2? Thanks, Jason
On Thu, Oct 05, 2023 at 08:56:30AM -0300, Jason Gunthorpe wrote: > On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote: > > On Tue, Sep 26, 2023 at 02:52:13PM +0100, Catalin Marinas wrote: > > > > [...] > > > > > Anyway, the text looks fine to me. Thanks for putting it together > > > Lorenzo. > > > > Thanks ! > > > > > One thing not mentioned here is that pci-vfio still maps such memory as > > > Device-nGnRnE in user space and relaxing this potentially creates an > > > alias. But such alias is only relevant of both the VMM and the VM try to > > > access the same device which I doubt is a realistic scenario. > > > > A revised log, FWIW: > > What is the plan here, do you want Ankit to resend the series with > this text? I have put together the text to summarize the discussions that led to this patch; if that helps somehow, yes please, make it the commit log. > There were no comments on patch 1/2? I don't have enough background knowledge to make a statement on this series' code, I just wanted to make sure the architectural details are in the commit log so that it can help the reviewers and, if it makes it to the kernel, future developers. Thanks, Lorenzo
On Thu, Sep 07, 2023 at 11:14:59AM -0700, ankita@nvidia.com wrote: > From: Ankit Agrawal <ankita@nvidia.com> > > Linux allows device drivers to map IO memory on a per-page basis using > "write combining" or WC. This is often done using > pgprot_writecombing(). The driver knows which pages can support WC > access and the proper programming model to generate this IO. Generally > the use case is to boost performance by using write combining to > generate larger PCIe MemWr TLPs. > > Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for > all IO memory. This puts the VM in charge of the memory attributes, > and removes the KVM override to DEVICE_nGnRE. > > Ultimately this makes pgprot_writecombing() work correctly in VMs and > allows drivers like mlx5 to fully operate their HW. > > After some discussions with ARM and CPU architects we reached the > conclusion there was no need for KVM to prevent the VM from selecting > between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear > that NORMAL_NC could result in uncontained failures, but upon deeper > analysis it turns out there are already possible cases for uncontained > failures with DEVICE types too. Ultimately the platform must be > implemented in a way that ensures that all DEVICE_* and NORMAL_NC > accesses have no uncontained failures. > > Fortunately real platforms do tend to implement this. > > This patch makes the VM's memory attributes behave as follows: > > S1 | S2 | Result > NORMAL-WB | NORMAL-NC | NORMAL-NC > NORMAL-WT | NORMAL-NC | NORMAL-NC > NORMAL-NC | NORMAL-NC | NORMAL-NC > DEVICE<attr> | NORMAL-NC | DEVICE<attr> > > See section D8.5.5 of DDI0487_I_a_a-profile_architecture_reference_manual.pdf > for details. > > Signed-off-by: Ankit Agrawal <ankita@nvidia.com> > --- > arch/arm64/include/asm/memory.h | 2 ++ > arch/arm64/kvm/hyp/pgtable.c | 2 +- > 2 files changed, 3 insertions(+), 1 deletion(-) > > diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h > index fde4186cc387..c247e5f29d5a 100644 > --- a/arch/arm64/include/asm/memory.h > +++ b/arch/arm64/include/asm/memory.h > @@ -147,6 +147,7 @@ > * Memory types for Stage-2 translation > */ > #define MT_S2_NORMAL 0xf > +#define MT_S2_NORMAL_NC 0x5 > #define MT_S2_DEVICE_nGnRE 0x1 > > /* > @@ -154,6 +155,7 @@ > * Stage-2 enforces Normal-WB and Device-nGnRE > */ > #define MT_S2_FWB_NORMAL 6 > +#define MT_S2_FWB_NORMAL_NC 5 > #define MT_S2_FWB_DEVICE_nGnRE 1 > > #ifdef CONFIG_ARM64_4K_PAGES > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > index ccd291b6893d..a80949002191 100644 > --- a/arch/arm64/kvm/hyp/pgtable.c > +++ b/arch/arm64/kvm/hyp/pgtable.c > @@ -696,7 +696,7 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p > kvm_pte_t *ptep) > { > bool device = prot & KVM_PGTABLE_PROT_DEVICE; > - kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, DEVICE_nGnRE) : > + kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, NORMAL_NC) : > KVM_S2_MEMATTR(pgt, NORMAL); > u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS; I think this is putting the policy into the low-level page-table code, which isn't where it belongs. KVM_PGTABLE_PROT_DEVICE means that the mapping should be created with device attributes. If you want other attributes, then please introduce another entry to 'kvm_pgtable_prot' and pass that instead (e.g. KVM_PGTABLE_PROT_NC, which coincidentally we already have in Android [1] ;). Will [1] https://android.googlesource.com/kernel/common/+/72cc19df8b71095f9740ff0ca6a75bf7ed27b0cd%5E%21/
On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote: > On Tue, Sep 26, 2023 at 02:52:13PM +0100, Catalin Marinas wrote: > > [...] > > > Anyway, the text looks fine to me. Thanks for putting it together > > Lorenzo. > > Thanks ! > > > One thing not mentioned here is that pci-vfio still maps such memory as > > Device-nGnRnE in user space and relaxing this potentially creates an > > alias. But such alias is only relevant of both the VMM and the VM try to > > access the same device which I doubt is a realistic scenario. > > A revised log, FWIW: Thanks for putting this together, Lorenzo. Just one thing below: > --- > Currently, KVM for ARM64 maps at stage 2 memory that is > considered device (ie it is not RAM) with DEVICE_nGnRE > memory attributes; this setting overrides (as per the ARM > architecture [1]) any device MMIO mapping present at stage > 1, resulting in a set-up whereby a guest operating system > can't determine device MMIO mapping memory attributes on its > own but it is always overriden by the KVM stage 2 default. > > This set-up does not allow guest operating systems to select > device memory attributes on a page by page basis independently > from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage > 2 memory type attributes"), which turns out to be an issue in that > guest operating systems (eg Linux) may request to map > devices MMIO regions with memory attributes that guarantee > better performance (eg gathering attribute - that for some > devices can generate larger PCIe memory writes TLPs) > and specific operations (eg unaligned transactions) such as > the NormalNC memory type. > > The default device stage 2 mapping was chosen in KVM > for ARM64 since it was considered safer (ie it would > not allow guests to trigger uncontained failures > ultimately crashing the machine) but this turned out > to be imprecise. > > Failures containability is a property of the platform > and is independent from the memory type used for MMIO > device memory mappings (ie DEVICE_nGnRE memory type is > even more problematic than NormalNC in terms of containability > since eg aborts triggered on loads cannot be made synchronous, > which make them harder to contain); this means that, > regardless of the combined stage1+stage2 mappings a > platform is safe if and only if device transactions cannot trigger > uncontained failures; reworded, the default KVM device > stage 2 memory attributes play no role in making device > assignment safer for a given platform and therefore can > be relaxed. > > For all these reasons, relax the KVM stage 2 device > memory attributes from DEVICE_nGnRE to NormalNC. The reasoning above suggests to me that this should probably just be Normal cacheable, as that is what actually allows the guest to control the attributes. So what is the rationale behind stopping at Normal-NC? Will
On Thu, Oct 12, 2023 at 01:35:41PM +0100, Will Deacon wrote: > > Failures containability is a property of the platform > > and is independent from the memory type used for MMIO > > device memory mappings (ie DEVICE_nGnRE memory type is > > even more problematic than NormalNC in terms of containability > > since eg aborts triggered on loads cannot be made synchronous, > > which make them harder to contain); this means that, > > regardless of the combined stage1+stage2 mappings a > > platform is safe if and only if device transactions cannot trigger > > uncontained failures; reworded, the default KVM device > > stage 2 memory attributes play no role in making device > > assignment safer for a given platform and therefore can > > be relaxed. > > > > For all these reasons, relax the KVM stage 2 device > > memory attributes from DEVICE_nGnRE to NormalNC. > > The reasoning above suggests to me that this should probably just be > Normal cacheable, as that is what actually allows the guest to control > the attributes. So what is the rationale behind stopping at Normal-NC? I agree it would be very nice if the all the memory in the guest could just be cachable and the guest could select everything. However, I think Lorenzo over stated the argument. The off-list discussion was focused on NormalNC for MMIO only. Nobody raised the idea that cachable was safe from uncontained errors for MMIO. I'm looking through the conversations and I wouldn't jump to concluding that "cachable MMIO" is safe from uncontained failures. Catalin has already raised a number of conerns in the other patch about making actual "designed to be cachable memory" into KVM cachable. Regards, Jason
On Thu, Oct 12, 2023 at 01:35:41PM +0100, Will Deacon wrote: > On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote: > > For all these reasons, relax the KVM stage 2 device > > memory attributes from DEVICE_nGnRE to NormalNC. > > The reasoning above suggests to me that this should probably just be > Normal cacheable, as that is what actually allows the guest to control > the attributes. So what is the rationale behind stopping at Normal-NC? It's more like we don't have any clue on what may happen. MTE is obviously a case where it can go wrong (we can blame the architecture design here) but I recall years ago where a malicious guest could bring the platform down by mapping the GIC CPU interface as cacheable. Not sure how error containment works with cacheable memory. A cacheable access to a device may stay in the cache a lot longer after the guest has been scheduled out, only evicted at some random time. We may no longer be able to associate it with the guest, especially if the guest exited. Also not sure about claiming back the device after killing the guest, do we need cache maintenance? So, for now I'd only relax this if we know there's RAM(-like) on the other side and won't trigger some potentially uncontainable errors as a result.
On Thu, Oct 12, 2023 at 10:20:45AM -0300, Jason Gunthorpe wrote: > On Thu, Oct 12, 2023 at 01:35:41PM +0100, Will Deacon wrote: > > > > Failures containability is a property of the platform > > > and is independent from the memory type used for MMIO > > > device memory mappings (ie DEVICE_nGnRE memory type is > > > even more problematic than NormalNC in terms of containability > > > since eg aborts triggered on loads cannot be made synchronous, > > > which make them harder to contain); this means that, > > > regardless of the combined stage1+stage2 mappings a > > > platform is safe if and only if device transactions cannot trigger > > > uncontained failures; reworded, the default KVM device > > > stage 2 memory attributes play no role in making device > > > assignment safer for a given platform and therefore can > > > be relaxed. > > > > > > For all these reasons, relax the KVM stage 2 device > > > memory attributes from DEVICE_nGnRE to NormalNC. > > > > The reasoning above suggests to me that this should probably just be > > Normal cacheable, as that is what actually allows the guest to control > > the attributes. So what is the rationale behind stopping at Normal-NC? > > I agree it would be very nice if the all the memory in the guest could > just be cachable and the guest could select everything. > > However, I think Lorenzo over stated the argument. > > The off-list discussion was focused on NormalNC for MMIO only. Nobody > raised the idea that cachable was safe from uncontained errors for > MMIO. True, it should be made clearer ie "independent from the non-cacheable/uncacheable memory type...", please update the log accordingly, forgive me the confusion I raised. Lorenzo > I'm looking through the conversations and I wouldn't jump to > concluding that "cachable MMIO" is safe from uncontained failures. > > Catalin has already raised a number of conerns in the other patch > about making actual "designed to be cachable memory" into KVM > cachable. > > Regards, > Jason
On Thu, Oct 12, 2023 at 02:53:21PM +0100, Catalin Marinas wrote: > On Thu, Oct 12, 2023 at 01:35:41PM +0100, Will Deacon wrote: > > On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote: > > > For all these reasons, relax the KVM stage 2 device > > > memory attributes from DEVICE_nGnRE to NormalNC. > > > > The reasoning above suggests to me that this should probably just be > > Normal cacheable, as that is what actually allows the guest to control > > the attributes. So what is the rationale behind stopping at Normal-NC? > > It's more like we don't have any clue on what may happen. MTE is > obviously a case where it can go wrong (we can blame the architecture > design here) but I recall years ago where a malicious guest could bring > the platform down by mapping the GIC CPU interface as cacheable. ... and do we know that isn't the case for non-cacheable? If not, why not? Also, are you saying we used to map the GIC CPU interface as cacheable at stage-2? I remember exclusives causing a problem, but I don't remember the guest having a cacheable mapping. > Not sure how error containment works with cacheable memory. A cacheable > access to a device may stay in the cache a lot longer after the guest > has been scheduled out, only evicted at some random time. But similarly, non-cacheable stores can be buffered. Why isn't that a problem? > We may no longer be able to associate it with the guest, especially if the > guest exited. Also not sure about claiming back the device after killing > the guest, do we need cache maintenance? Claiming back the device also seems strange if the guest has been using non-cacheable accesses since I think you could get write merging and reordering with subsequent device accesses trying to reset the device. > So, for now I'd only relax this if we know there's RAM(-like) on the > other side and won't trigger some potentially uncontainable errors as a > result. I guess my wider point is that I'm not convinced that non-cacheable is actually much better and I think we're going way off the deep end looking at what particular implementations do and trying to justify to ourselves that non-cacheable is safe, even though it's still a normal memory type at the end of the day. Obviously, it's up to Marc and Oliver if they want to do this, but I'm wary without an official statement from Arm to say that Normal-NC is correct. There's mention of such a statement in the cover letter: > We hope ARM will publish information helping platform designers > follow these guidelines. but imo we shouldn't merge this without either: (a) _Architectural_ guidance (as opposed to some random whitepaper or half-baked certification scheme). - or - (b) A concrete justification based on the current architecture as to why Normal-NC is the right thing to do for KVM. The current wording talks about use-cases (I get this) and error containment (it's a property of the system) but doesn't talk at all about why Normal-NC is the right result. Will
On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote: > I guess my wider point is that I'm not convinced that non-cacheable is > actually much better and I think we're going way off the deep end looking > at what particular implementations do and trying to justify to ourselves > that non-cacheable is safe, even though it's still a normal memory type > at the end of the day. When we went over this with ARM it became fairly clear there wasn't an official statement that Device-* is safe from uncontained failures. For instance, looking at the actual IP, our architects pointed out that ARM IP already provides ways for Device-* to trigger uncontained failures today. We then mutually concluded that KVM safe implementations must already be preventing uncontained failures for Device-* at the system level and that same prevention will carry over to NormalNC as well. IMHO, this seems to be a gap where ARM has not fully defined when uncontained failures are allowed and left that as an implementation choice. In other words, KVM safety around uncontained failure is not a property that can be reasoned about from the ARM architecture alone. > The current wording talks about use-cases (I get this) and error containment > (it's a property of the system) but doesn't talk at all about why Normal-NC > is the right result. Given that Device-* and NormalNC are equally implementation defined with regards to uncontained failures, NormalNC allows more VM functionality. Further, we have a broad agreement that this use case is important, and that NormalNC is the correct way to adress it. I think you are right to ask for more formality from ARM team but also we shouldn't hold up fixing real functional bugs in real shipping server ARM products. Jason
On Thu, Oct 12, 2023 at 12:44:39PM -0300, Jason Gunthorpe wrote: > On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote: > > > I guess my wider point is that I'm not convinced that non-cacheable is > > actually much better and I think we're going way off the deep end looking > > at what particular implementations do and trying to justify to ourselves > > that non-cacheable is safe, even though it's still a normal memory type > > at the end of the day. > > When we went over this with ARM it became fairly clear there wasn't an > official statement that Device-* is safe from uncontained > failures. For instance, looking at the actual IP, our architects > pointed out that ARM IP already provides ways for Device-* to trigger > uncontained failures today. > > We then mutually concluded that KVM safe implementations must already > be preventing uncontained failures for Device-* at the system level > and that same prevention will carry over to NormalNC as well. > > IMHO, this seems to be a gap where ARM has not fully defined when > uncontained failures are allowed and left that as an implementation > choice. > > In other words, KVM safety around uncontained failure is not a > property that can be reasoned about from the ARM architecture alone. > > > The current wording talks about use-cases (I get this) and error containment > > (it's a property of the system) but doesn't talk at all about why Normal-NC > > is the right result. > > Given that Device-* and NormalNC are equally implementation defined > with regards to uncontained failures, NormalNC allows more VM > functionality. > > Further, we have a broad agreement that this use case is important, > and that NormalNC is the correct way to adress it. > > I think you are right to ask for more formality from ARM team but also > we shouldn't hold up fixing real functional bugs in real shipping > server ARM products. All I'm asking for is justification as to why Normal-NC is the right memory type rather than any other normal memory type. If it's not possible to explain that architecturally, then I'm not sure this change belongs in architecture code. Ultimately, we need to be able to maintain this stuff, so we can't just blindly implement changes based on a combination of off-list discussions and individual product needs. For example, if somebody else rocks up tomorrow and asks for this to be Normal-writethrough, what grounds do we have to say no if we've taken this change already? So please let's get to a point where we can actually reason about this. Will
On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote: > On Thu, Oct 12, 2023 at 02:53:21PM +0100, Catalin Marinas wrote: > > On Thu, Oct 12, 2023 at 01:35:41PM +0100, Will Deacon wrote: > > > On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote: > > > > For all these reasons, relax the KVM stage 2 device > > > > memory attributes from DEVICE_nGnRE to NormalNC. > > > > > > The reasoning above suggests to me that this should probably just be > > > Normal cacheable, as that is what actually allows the guest to control > > > the attributes. So what is the rationale behind stopping at Normal-NC? > > > > It's more like we don't have any clue on what may happen. MTE is > > obviously a case where it can go wrong (we can blame the architecture > > design here) but I recall years ago where a malicious guest could bring > > the platform down by mapping the GIC CPU interface as cacheable. > > ... and do we know that isn't the case for non-cacheable? If not, why not? Trying to get this information from the hw folk and architects is really hard. So we only relax it one step at a time ;). But given the MTE problems, I'd not go for cacheable Stage 2 unless we have FEAT_MTE_PERM implemented (both hw and sw). S2 cacheable allows the guest to map it as Normal Tagged. > Also, are you saying we used to map the GIC CPU interface as cacheable > at stage-2? I remember exclusives causing a problem, but I don't remember > the guest having a cacheable mapping. The guest never had a cacheable mapping, IIRC it was more of a theoretical problem, plugging a hole. Now, maybe I misremember, it's pretty hard to search the git logs given how the code was moved around (but I do remember the building we were in when discussing this, it was on the ground floor ;)). > > Not sure how error containment works with cacheable memory. A cacheable > > access to a device may stay in the cache a lot longer after the guest > > has been scheduled out, only evicted at some random time. > > But similarly, non-cacheable stores can be buffered. Why isn't that a > problem? RAS might track this for cacheable mappings as well, I just haven't figured out the details. > > We may no longer be able to associate it with the guest, especially if the > > guest exited. Also not sure about claiming back the device after killing > > the guest, do we need cache maintenance? > > Claiming back the device also seems strange if the guest has been using > non-cacheable accesses since I think you could get write merging and > reordering with subsequent device accesses trying to reset the device. True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)). > > So, for now I'd only relax this if we know there's RAM(-like) on the > > other side and won't trigger some potentially uncontainable errors as a > > result. > > I guess my wider point is that I'm not convinced that non-cacheable is > actually much better and I think we're going way off the deep end looking > at what particular implementations do and trying to justify to ourselves > that non-cacheable is safe, even though it's still a normal memory type > at the end of the day. Is this about Device vs NC or Device/NC vs Normal Cacheable? The justification for the former has been summarised in Lorenzo's write-up. How the hardware behaves, it depends a lot on the RAS implementation. The BSA has some statements but not sure it covers everything. Things can go wrong but that's not because Device does anything better. Given the RAS implementation, external aborts caused on Device memory (e.g. wrong size access) is uncontainable. For Normal NC it can be contained (I can dig out the reasoning behind this if you want, IIUC something to do with not being able to cancel an already issued Device access since such accesses don't allow speculation due to side-effects; for Normal NC, it's just about the software not getting the data). > Obviously, it's up to Marc and Oliver if they want to do this, but I'm > wary without an official statement from Arm to say that Normal-NC is > correct. There's mention of such a statement in the cover letter: > > > We hope ARM will publish information helping platform designers > > follow these guidelines. > > but imo we shouldn't merge this without either: > > (a) _Architectural_ guidance (as opposed to some random whitepaper or > half-baked certification scheme). Well, you know the story, the architects will probably make it a SoC or integration issue, PCIe etc., not something that can live in the Arm ARM. The best we could get is more recommendations in the RAS spec around containment but not for things that might happen outside the CPU, e.g. PCIe root complex. > - or - > > (b) A concrete justification based on the current architecture as to > why Normal-NC is the right thing to do for KVM. To put it differently, we don't have any strong arguments why Device is the right thing to do. We chose Device based on some understanding software people had about how the hardware behaves, which apparently wasn't entirely correct (and summarised by Lorenzo).
On Thu, Oct 12, 2023 at 05:39:31PM +0100, Will Deacon wrote: > All I'm asking for is justification as to why Normal-NC is the right > memory type rather than any other normal memory type. If it's not possible > to explain that architecturally, then I'm not sure this change belongs in > architecture code. Well, I think Catalin summarized it nicely, I second his ask at the end: We are basically at your scenario below - can you justify why DEVICE_GRE is correct today, architecturally? We could not. Earlier someone said uncontained failure prevention, but a deep dive on that found it is not so. > Ultimately, we need to be able to maintain this stuff, so we can't just > blindly implement changes based on a combination of off-list discussions > and individual product needs. For example, if somebody else rocks up > tomorrow and asks for this to be Normal-writethrough, what grounds do > we have to say no if we've taken this change already? Hmm. Something got lost here. This patch is talking about the S2 MemAttr[3:0]. There are only 4 relevant values (when FEAT_S2FWB) (see D5.5.5 in ARM DDI 0487F.c) 0b001 - Today: force VM to be Device nGnRE 0b101 - Proposed: prevent the VM from selecting cachable, allow it to choose Device-* or NormalNC 0b110 - Force write back. Would totally break MMIO, ruled out 0b111 - Allow the VM to select anything, including cachable. This is nice, but summarizing Catalin's remarks: a) The kernel can't do cache maintenance (defeats FWB) b) Catalin's concerns about MTE and Atomics triggering uncontained failures c) It is unclear about uncontained failures for cachable in general So the only argument is 001 v 110 v 111 Catalin has explained why 111 is not good as a default. Most likely with some future ACPI/BSA/etc work, and some cache syncing in the kernel, someone could define a way to allow 111 as a choice. So, I think we can rule out 111 as being the default choice without more the kernel getting more detailed system level knowledge. Further, patch 1 is about making 110 a driver-opt-in choice for VFIO memory which reduces the need for 111. For 001 v 110: 001 is less functional in the VM. 001 offers no advantage. !FEAT_S2FWB has similar choices and similar argument. So, IMHO, if someone comes to ask for something it would be to ask for 111 and we do have a set of pretty clear reasons why it should not be 111. (indeed we wanted to ask for that instead of patch 1 but there are too many things required to get there), Jason
On Thu, Oct 12, 2023 at 06:26:01PM +0100, Catalin Marinas wrote: > On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote: > > Claiming back the device also seems strange if the guest has been using > > non-cacheable accesses since I think you could get write merging and > > reordering with subsequent device accesses trying to reset the device. > > True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)). We do have a good story for this part: use Device-nGnRE! > > > So, for now I'd only relax this if we know there's RAM(-like) on the > > > other side and won't trigger some potentially uncontainable errors as a > > > result. > > > > I guess my wider point is that I'm not convinced that non-cacheable is > > actually much better and I think we're going way off the deep end looking > > at what particular implementations do and trying to justify to ourselves > > that non-cacheable is safe, even though it's still a normal memory type > > at the end of the day. > > Is this about Device vs NC or Device/NC vs Normal Cacheable? The > justification for the former has been summarised in Lorenzo's write-up. > How the hardware behaves, it depends a lot on the RAS implementation. > The BSA has some statements but not sure it covers everything. I don't know how to be more clear, but I'm asking why Normal-NC is the right memory type to use. Jason's explanation on the other thread about how it's basically the only option with FWB (with some hand-waving about why Normal-cacheable isn't safe) will have to do, but it needs to be in the commit message. The host maps MMIO with Device-nGnRE. Allowing a guest to map it as Normal surely means the host is going to need additional logic to deal with that? We briefly discussed claiming back a device above and I'm not convinced that the code we have for doing that today will work correctly if the guest has issued a bunch of Normal-NC stores prior to the device being unmapped. Could we change these patches so that the memory type of the stage-1 VMA in the VMM is reflected in the stage-2? In other words, continue to use Device mappings at stage-2 for I/O but relax to Normal-NC if that's how the VMM has it mapped? > Things can go wrong but that's not because Device does anything better. > Given the RAS implementation, external aborts caused on Device memory > (e.g. wrong size access) is uncontainable. For Normal NC it can be > contained (I can dig out the reasoning behind this if you want, IIUC > something to do with not being able to cancel an already issued Device > access since such accesses don't allow speculation due to side-effects; > for Normal NC, it's just about the software not getting the data). I really think these details belong in the commit message. > > Obviously, it's up to Marc and Oliver if they want to do this, but I'm > > wary without an official statement from Arm to say that Normal-NC is > > correct. There's mention of such a statement in the cover letter: > > > > > We hope ARM will publish information helping platform designers > > > follow these guidelines. > > > > but imo we shouldn't merge this without either: > > > > (a) _Architectural_ guidance (as opposed to some random whitepaper or > > half-baked certification scheme). > > Well, you know the story, the architects will probably make it a SoC or > integration issue, PCIe etc., not something that can live in the Arm > ARM. The best we could get is more recommendations in the RAS spec > around containment but not for things that might happen outside the CPU, > e.g. PCIe root complex. The Arm ARM _does_ mention PCI config space when talking about early write acknowledgement, so there's some precedence for providing guidance around which memory types to use. > > - or - > > > > (b) A concrete justification based on the current architecture as to > > why Normal-NC is the right thing to do for KVM. > > To put it differently, we don't have any strong arguments why Device is > the right thing to do. We chose Device based on some understanding > software people had about how the hardware behaves, which apparently > wasn't entirely correct (and summarised by Lorenzo). I think we use Device because that's what the host uses in its stage-1 and mismatched aliases are bad. Will
On Thu, Oct 12, 2023 at 03:36:24PM -0300, Jason Gunthorpe wrote: > On Thu, Oct 12, 2023 at 05:39:31PM +0100, Will Deacon wrote: > > > All I'm asking for is justification as to why Normal-NC is the right > > memory type rather than any other normal memory type. If it's not possible > > to explain that architecturally, then I'm not sure this change belongs in > > architecture code. > > Well, I think Catalin summarized it nicely, I second his ask at the end: > > We are basically at your scenario below - can you justify why > DEVICE_GRE is correct today, architecturally? We could not. Earlier > someone said uncontained failure prevention, but a deep dive on that > found it is not so. This logic can be used to justify the use of any other memory type. I'm asking specifically why Normal-NC is correct. > > Ultimately, we need to be able to maintain this stuff, so we can't just > > blindly implement changes based on a combination of off-list discussions > > and individual product needs. For example, if somebody else rocks up > > tomorrow and asks for this to be Normal-writethrough, what grounds do > > we have to say no if we've taken this change already? > > Hmm. Something got lost here. Well, I didn't assume FWB since these patches change the behaviour regardless... > This patch is talking about the S2 MemAttr[3:0]. There are only 4 > relevant values (when FEAT_S2FWB) (see D5.5.5 in ARM DDI 0487F.c) > > 0b001 - Today: force VM to be Device nGnRE > > 0b101 - Proposed: prevent the VM from selecting cachable, allow it to > choose Device-* or NormalNC > > 0b110 - Force write back. Would totally break MMIO, ruled out > > 0b111 - Allow the VM to select anything, including cachable. > This is nice, but summarizing Catalin's remarks: > a) The kernel can't do cache maintenance (defeats FWB) > b) Catalin's concerns about MTE and Atomics triggering > uncontained failures > c) It is unclear about uncontained failures for cachable > in general > > So the only argument is 001 v 110 v 111 Ok, so I can see why you end up with Normal-NC on a system that implements FWB. Systems without FWB have other options, but you can reasonably argue that having consistent behaviour between the two configurations is desirable. Please can you add that to the commit message? I still don't understand how to reclaim an MMIO region that the guest has mapped Normal-NC (see the thread with Catalin). Will
On Fri, Oct 13, 2023 at 10:29:35AM +0100, Will Deacon wrote: > On Thu, Oct 12, 2023 at 06:26:01PM +0100, Catalin Marinas wrote: > > On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote: > > > Claiming back the device also seems strange if the guest has been using > > > non-cacheable accesses since I think you could get write merging and > > > reordering with subsequent device accesses trying to reset the device. > > > > True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)). > > We do have a good story for this part: use Device-nGnRE! Don't we actually need Device-nGnRnE for this, coupled with a DSB for endpoint completion? Device-nGnRE may be sufficient as a read from that device would ensure that the previous write is observable (potentially with a DMB if accessing separate device regions) but I don't think we do this now either. Even this, isn't it device-specific? I don't know enough about PCIe, posted writes, reordering, maybe others can shed some light. For Normal NC, if the access doesn't have side-effects (or rather the endpoint is memory-like), I think we are fine. The Stage 2 unmapping + TLBI + DSB (DVM + DVMSync) should ensure that a pending write by the CPU was pushed sufficiently far as not to affect subsequent writes by other CPUs. For I/O accesses that change some state of the device, I'm not sure the TLBI+DSB is sufficient. But I don't think Device nGnRE is either, only nE + DSB as long as the PCIe device plays along nicely. > Could we change these patches so that the memory type of the stage-1 VMA > in the VMM is reflected in the stage-2? In other words, continue to use > Device mappings at stage-2 for I/O but relax to Normal-NC if that's > how the VMM has it mapped? We've been through this and it's not feasible. The VMM does not have detailed knowledge of the BARs of the PCIe device it is mapping (and the prefetchable BAR attribute is useless). It may end up with a Normal mapping of a BAR with read side-effects. It's only the guest driver that knows all the details. The safest is for the VMM to keep it as Device (I think vfio-pci goes for the strongest nGnRnE). Yes, we end up with mismatched aliases but they only matter if the VMM also accesses the I/O range via its own mapping. So far I haven't seen case that suggests this. > > Things can go wrong but that's not because Device does anything better. > > Given the RAS implementation, external aborts caused on Device memory > > (e.g. wrong size access) is uncontainable. For Normal NC it can be > > contained (I can dig out the reasoning behind this if you want, IIUC > > something to do with not being able to cancel an already issued Device > > access since such accesses don't allow speculation due to side-effects; > > for Normal NC, it's just about the software not getting the data). > > I really think these details belong in the commit message. I guess another task for Lorenzo ;). > > > Obviously, it's up to Marc and Oliver if they want to do this, but I'm > > > wary without an official statement from Arm to say that Normal-NC is > > > correct. There's mention of such a statement in the cover letter: > > > > > > > We hope ARM will publish information helping platform designers > > > > follow these guidelines. > > > > > > but imo we shouldn't merge this without either: > > > > > > (a) _Architectural_ guidance (as opposed to some random whitepaper or > > > half-baked certification scheme). > > > > Well, you know the story, the architects will probably make it a SoC or > > integration issue, PCIe etc., not something that can live in the Arm > > ARM. The best we could get is more recommendations in the RAS spec > > around containment but not for things that might happen outside the CPU, > > e.g. PCIe root complex. > > The Arm ARM _does_ mention PCI config space when talking about early write > acknowledgement, so there's some precedence for providing guidance around > which memory types to use. Ah, yes, it looks like it does, though mostly around the config space. We could ask them to add some notes but I don't think we have the problem well defined yet. Trying to restate what we aim: the guest driver knows what attributes it needs and would set the appropriate attributes: Device or Normal. KVM's role is not to fix bugs in the guest driver by constraining the attributes but rather to avoid potential security issues with malicious (or buggy) guests: 1) triggering uncontained errors 2) accessing memory that it shouldn't (like the MTE tag access) 3) causing delayed side-effects after the host reclaims the device ... anything else? For (1), Normal NC vs. Device doesn't make any difference, slightly better for the former. (2) so far is solved by not allowing Cacheable (or disabling MTE, enabling FEAT_MTE_PERM in the future). I'm now trying to understand (3), I think it needs more digging. > > > (b) A concrete justification based on the current architecture as to > > > why Normal-NC is the right thing to do for KVM. > > > > To put it differently, we don't have any strong arguments why Device is > > the right thing to do. We chose Device based on some understanding > > software people had about how the hardware behaves, which apparently > > wasn't entirely correct (and summarised by Lorenzo). > > I think we use Device because that's what the host uses in its stage-1 > and mismatched aliases are bad. They are "constrained" bad ;).
On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote: > On Fri, Oct 13, 2023 at 10:29:35AM +0100, Will Deacon wrote: > > On Thu, Oct 12, 2023 at 06:26:01PM +0100, Catalin Marinas wrote: > > > On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote: > > > > Claiming back the device also seems strange if the guest has been using > > > > non-cacheable accesses since I think you could get write merging and > > > > reordering with subsequent device accesses trying to reset the device. > > > > > > True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)). > > > > We do have a good story for this part: use Device-nGnRE! > > Don't we actually need Device-nGnRnE for this, coupled with a DSB for > endpoint completion? > > Device-nGnRE may be sufficient as a read from that device would ensure > that the previous write is observable (potentially with a DMB if > accessing separate device regions) but I don't think we do this now > either. Even this, isn't it device-specific? I don't know enough about > PCIe, posted writes, reordering, maybe others can shed some light. > > For Normal NC, if the access doesn't have side-effects (or rather the > endpoint is memory-like), I think we are fine. The Stage 2 unmapping + > TLBI + DSB (DVM + DVMSync) should ensure that a pending write by the CPU > was pushed sufficiently far as not to affect subsequent writes by other > CPUs. > > For I/O accesses that change some state of the device, I'm not sure the > TLBI+DSB is sufficient. But I don't think Device nGnRE is either, only > nE + DSB as long as the PCIe device plays along nicely. Can someone explain this concern a little more simply please? Let's try something simpler. I have no KVM. My kernel driver creates a VMA with pgprot_writecombine (NormalNC). Userpsace does a write to the NormalNC and immediately unmaps the VMA What is the issue? And then how does making KVM the thing that creates the NormalNC change this? Not knowing the whole details, here is my story about how it should work: Unmapping the VMA's must already have some NormalNC friendly ordering barrier across all CPUs or we have a bigger problem. This barrier definately must close write combining. VFIO issues a config space write to reset the PCI function. Config space writes MUST NOT write combine with anything. This is already impossible for PCIe since they are different TLP types at the PCIe level. By the PCIe rules, config space write must order strictly after all other CPU's accesses. Once the reset non-posted write returns back to VFIO we know that: 1) There is no reference in any CPU page table to the MMIO PFN 2) No CPU has pending data in any write buffer 3) The interconnect and PCIe fabric have no inflight operations 4) The device is in a clean post-reset state ? > knows all the details. The safest is for the VMM to keep it as Device (I > think vfio-pci goes for the strongest nGnRnE). We are probably going to allow VFIO to let userspace pick if it should be pgprot_device or pgprot_writecombine. The alias issue could be resolved by teaching KVM how to insert a physical PFN based on some VFIO FD/dmabuf rather than a VMA so that the PFNs are never mapped in the hypervisor side. Jsaon
On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote: [...] > Yes, we end up with mismatched aliases but they only matter if the VMM > also accesses the I/O range via its own mapping. So far I haven't seen > case that suggests this. > > > > Things can go wrong but that's not because Device does anything better. > > > Given the RAS implementation, external aborts caused on Device memory > > > (e.g. wrong size access) is uncontainable. For Normal NC it can be > > > contained (I can dig out the reasoning behind this if you want, IIUC > > > something to do with not being able to cancel an already issued Device > > > access since such accesses don't allow speculation due to side-effects; > > > for Normal NC, it's just about the software not getting the data). > > > > I really think these details belong in the commit message. > > I guess another task for Lorenzo ;). I will do, I start wondering though whether this documentation belongs in this commit log only or at Documentation/arch/arm64 level (or both), I am pretty sure this thread can turn out quite useful as a reference (it is for me) if we manage to summarize it that would benefit everyone. Lorenzo
On Fri, Oct 13, 2023 at 10:45:41AM -0300, Jason Gunthorpe wrote: > On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote: > > On Fri, Oct 13, 2023 at 10:29:35AM +0100, Will Deacon wrote: > > > On Thu, Oct 12, 2023 at 06:26:01PM +0100, Catalin Marinas wrote: > > > > On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote: > > > > > Claiming back the device also seems strange if the guest has been using > > > > > non-cacheable accesses since I think you could get write merging and > > > > > reordering with subsequent device accesses trying to reset the device. > > > > > > > > True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)). > > > > > > We do have a good story for this part: use Device-nGnRE! > > > > Don't we actually need Device-nGnRnE for this, coupled with a DSB for > > endpoint completion? > > > > Device-nGnRE may be sufficient as a read from that device would ensure > > that the previous write is observable (potentially with a DMB if > > accessing separate device regions) but I don't think we do this now > > either. Even this, isn't it device-specific? I don't know enough about > > PCIe, posted writes, reordering, maybe others can shed some light. > > > > For Normal NC, if the access doesn't have side-effects (or rather the > > endpoint is memory-like), I think we are fine. The Stage 2 unmapping + > > TLBI + DSB (DVM + DVMSync) should ensure that a pending write by the CPU > > was pushed sufficiently far as not to affect subsequent writes by other > > CPUs. > > > > For I/O accesses that change some state of the device, I'm not sure the > > TLBI+DSB is sufficient. But I don't think Device nGnRE is either, only > > nE + DSB as long as the PCIe device plays along nicely. > > Can someone explain this concern a little more simply please? > > Let's try something simpler. I have no KVM. My kernel driver > creates a VMA with pgprot_writecombine (NormalNC). Userpsace does a > write to the NormalNC and immediately unmaps the VMA > > What is the issue? The issue is when the device is reclaimed to be given to another user-space process, do we know the previous transaction reached the device? With the TLBI + DSB in unmap, we can only tell that a subsequent map + write (in a new process) is ordered after the write in the old process. Maybe that's sufficient in most cases. > And then how does making KVM the thing that creates the NormalNC > change this? They are similar. > Not knowing the whole details, here is my story about how it should work: > > Unmapping the VMA's must already have some NormalNC friendly ordering > barrier across all CPUs or we have a bigger problem. This barrier > definately must close write combining. I think what we have is TLBI + DSB generating the DVM/DVMSync messages should ensure that the Normal NC writes on other CPUs reach some serialisation point (not necessarily the device, AFAICT we can't guarantee end-point completion here). > VFIO issues a config space write to reset the PCI function. Config > space writes MUST NOT write combine with anything. This is already > impossible for PCIe since they are different TLP types at the PCIe > level. Yes, config space writes are fine, vfio-pci even maps them as Device_nGnRnE. But AFAIK a guest won't have direct access to the config space. > By the PCIe rules, config space write must order strictly after all > other CPU's accesses. Once the reset non-posted write returns back to > VFIO we know that: > > 1) There is no reference in any CPU page table to the MMIO PFN > 2) No CPU has pending data in any write buffer > 3) The interconnect and PCIe fabric have no inflight operations > 4) The device is in a clean post-reset state I think from the CPU perspective, we can guarantee that a Normal_NC write on CPU0 for example reaches a serialisation point before a config space (Device_nGnRnE) write on CPU1 by the host as long as CPU1 issued a TLBI+DSB. Now, what I'm not sure is what this serialisation point is. If it is the PCIe root complex, we are probably fine, we hope it deals with any ordering between the Normal_NC write and the config space one. Talking to Will earlier, I think we can deem the PCIe scenario (somewhat) safe but not as a generic mechanism for other non-PCIe devices (e.g. platform). With this concern, can we make this Stage 2 relaxation in KVM only for vfio-pci mappings? I don't have an example of non-PCIe device assignment to figure out how this should work though. > > knows all the details. The safest is for the VMM to keep it as Device (I > > think vfio-pci goes for the strongest nGnRnE). > > We are probably going to allow VFIO to let userspace pick if it should > be pgprot_device or pgprot_writecombine. I guess that's for the direct use by an application rather than VMM+VM. IIUC people work around this currently by mapping PCIe BARs as pgprot_writecombine() via sysfs. Getting vfio-pci to allow different mappings is probably a good idea, though it doesn't currently help with the KVM case as we can't force the VMM to know the specifics of the device it is giving to a guest. > The alias issue could be resolved by teaching KVM how to insert a > physical PFN based on some VFIO FD/dmabuf rather than a VMA so that > the PFNs are never mapped in the hypervisor side. This should work as well and solves the aliasing problem, though it requires changes to the VMM as well, not just KVM, which currently relies on vfio-pci mmap().
On Fri, Oct 13, 2023 at 05:28:10PM +0200, Lorenzo Pieralisi wrote: > On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote: > > [...] > > > Yes, we end up with mismatched aliases but they only matter if the VMM > > also accesses the I/O range via its own mapping. So far I haven't seen > > case that suggests this. > > > > > > Things can go wrong but that's not because Device does anything better. > > > > Given the RAS implementation, external aborts caused on Device memory > > > > (e.g. wrong size access) is uncontainable. For Normal NC it can be > > > > contained (I can dig out the reasoning behind this if you want, IIUC > > > > something to do with not being able to cancel an already issued Device > > > > access since such accesses don't allow speculation due to side-effects; > > > > for Normal NC, it's just about the software not getting the data). > > > > > > I really think these details belong in the commit message. > > > > I guess another task for Lorenzo ;). > > I will do, I start wondering though whether this documentation belongs > in this commit log only or at Documentation/arch/arm64 level (or both), > I am pretty sure this thread can turn out quite useful as a reference > (it is for me) if we manage to summarize it that would benefit > everyone. I think it makes sense to add something in the Documentation for easier future reference. We can also keep adding to it as we learn more. The commit log can be shorter in this case.
On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote: > The issue is when the device is reclaimed to be given to another > user-space process, do we know the previous transaction reached the > device? With the TLBI + DSB in unmap, we can only tell that a subsequent > map + write (in a new process) is ordered after the write in the old > process. Maybe that's sufficient in most cases. I can't think of a case where ordering is insufficient. As long as the device perceives a strict seperation in order of writes preceeding the 'reset' writel and writes following the reset writel() the absolute time of completion can not matter. At least for VFIO PCI the reset sequences uses non-posted config TLPs followed by config read TLPs. For something like mlx5 when it reclaims a WC VMA it does a RPC which is dma_wmb(), a writel(), a device DMA read, a device DMA write and a CPU read fo DMA'd data. Both of these are pretty "firm" ordering barriers. > > Unmapping the VMA's must already have some NormalNC friendly ordering > > barrier across all CPUs or we have a bigger problem. This barrier > > definately must close write combining. > > I think what we have is TLBI + DSB generating the DVM/DVMSync messages > should ensure that the Normal NC writes on other CPUs reach some > serialisation point (not necessarily the device, AFAICT we can't > guarantee end-point completion here). This is what our internal architects thought as well. > Talking to Will earlier, I think we can deem the PCIe scenario > (somewhat) safe but not as a generic mechanism for other non-PCIe > devices (e.g. platform). With this concern, can we make this Stage 2 > relaxation in KVM only for vfio-pci mappings? I don't have an example of > non-PCIe device assignment to figure out how this should work though. It is not a KVM problem. As I implied above it is VFIO's responsibility to reliably reset the device, not KVMs. If for some reason VFIO cannot do that on some SOC then VFIO devices should not exist. It is not KVM's job to double guess VFIO's own security properties. Specifically about platform the generic VFIO platform driver is the ACPI based one. If the FW provides an ACPI method for device reset that is not properly serializing, that is a FW bug. We can quirk it in VFIO and block using those devices if they actually exist. I expect the non-generic VFIO platform drivers to take care of this issue internally with, barriers, read from devices, whatver is required to make their SOCs order properly. Just as I would expect a normal Linux platform driver to directly manage whatever implementation specific ordering quirks the SOC may have. Generic PCI drivers are the ones that rely on the implicit ordering at the PCIe TLP level from TLBI + DSB. I would say this is part of the undocumented arch API around pgprot_wc > > > knows all the details. The safest is for the VMM to keep it as Device (I > > > think vfio-pci goes for the strongest nGnRnE). > > > > We are probably going to allow VFIO to let userspace pick if it should > > be pgprot_device or pgprot_writecombine. > > I guess that's for the direct use by an application rather than > VMM+VM. Yes, there have been requests for this. > IIUC people work around this currently by mapping PCIe BARs as > pgprot_writecombine() via sysfs. Getting vfio-pci to allow different > mappings is probably a good idea, though it doesn't currently help with > the KVM case as we can't force the VMM to know the specifics of the > device it is giving to a guest. Yes to all points Thanks, Jason
On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote: [...] > > VFIO issues a config space write to reset the PCI function. Config > > space writes MUST NOT write combine with anything. This is already > > impossible for PCIe since they are different TLP types at the PCIe > > level. > > Yes, config space writes are fine, vfio-pci even maps them as > Device_nGnRnE. But AFAIK a guest won't have direct access to the config > space. > > > By the PCIe rules, config space write must order strictly after all > > other CPU's accesses. Once the reset non-posted write returns back to > > VFIO we know that: > > > > 1) There is no reference in any CPU page table to the MMIO PFN > > 2) No CPU has pending data in any write buffer > > 3) The interconnect and PCIe fabric have no inflight operations > > 4) The device is in a clean post-reset state > > I think from the CPU perspective, we can guarantee that a Normal_NC > write on CPU0 for example reaches a serialisation point before a config > space (Device_nGnRnE) write on CPU1 by the host as long as CPU1 issued a > TLBI+DSB. Now, what I'm not sure is what this serialisation point is. If > it is the PCIe root complex, we are probably fine, we hope it deals with > any ordering between the Normal_NC write and the config space one. If it is the PCI host bridge (and for PCI it should be since it is the logic between the ARM world - where ARM ordering rules and barriers apply - and PCI protocol), either it enforces PCI ordering rules or it is broken by design; if it is the latter, at that stage device assignment would be the least of our problems. For non-PCI device assignment, I am not sure at all we can rely on anything other than what Jason mentioned, eg resets (and the components that through eg MMIO are carrying them out) are not architected, the device MMIO space and the MMIO space used to trigger the reset (again, it is an example) may well be placed on different interconnect paths, it is device specific. Lorenzo > Talking to Will earlier, I think we can deem the PCIe scenario > (somewhat) safe but not as a generic mechanism for other non-PCIe > devices (e.g. platform). With this concern, can we make this Stage 2 > relaxation in KVM only for vfio-pci mappings? I don't have an example of > non-PCIe device assignment to figure out how this should work though. > > > > knows all the details. The safest is for the VMM to keep it as Device (I > > > think vfio-pci goes for the strongest nGnRnE). > > > > We are probably going to allow VFIO to let userspace pick if it should > > be pgprot_device or pgprot_writecombine. > > I guess that's for the direct use by an application rather than VMM+VM. > IIUC people work around this currently by mapping PCIe BARs as > pgprot_writecombine() via sysfs. Getting vfio-pci to allow different > mappings is probably a good idea, though it doesn't currently help with > the KVM case as we can't force the VMM to know the specifics of the > device it is giving to a guest. > > > The alias issue could be resolved by teaching KVM how to insert a > > physical PFN based on some VFIO FD/dmabuf rather than a VMA so that > > the PFNs are never mapped in the hypervisor side. > > This should work as well and solves the aliasing problem, though it > requires changes to the VMM as well, not just KVM, which currently > relies on vfio-pci mmap(). > > -- > Catalin
On Thu, Oct 19, 2023 at 08:51:42AM -0300, Jason Gunthorpe wrote: > On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote: > > Talking to Will earlier, I think we can deem the PCIe scenario > > (somewhat) safe but not as a generic mechanism for other non-PCIe > > devices (e.g. platform). With this concern, can we make this Stage 2 > > relaxation in KVM only for vfio-pci mappings? I don't have an example of > > non-PCIe device assignment to figure out how this should work though. > > It is not a KVM problem. As I implied above it is VFIO's > responsibility to reliably reset the device, not KVMs. If for some > reason VFIO cannot do that on some SOC then VFIO devices should not > exist. > > It is not KVM's job to double guess VFIO's own security properties. I'd argue that since KVM is the one relaxing the memory attributes beyond what the VFIO driver allows the VMM to use, it is KVM's job to consider the security implications. This is fine for vfio-pci and Normal_NC but I'm not sure we can generalise. > Specifically about platform the generic VFIO platform driver is the > ACPI based one. If the FW provides an ACPI method for device reset > that is not properly serializing, that is a FW bug. We can quirk it in > VFIO and block using those devices if they actually exist. > > I expect the non-generic VFIO platform drivers to take care of this > issue internally with, barriers, read from devices, whatver is > required to make their SOCs order properly. Just as I would expect a > normal Linux platform driver to directly manage whatever > implementation specific ordering quirks the SOC may have. This would be a new requirement if an existing VFIO platform driver relied on all mappings being Device. But maybe that's just theoretical at the moment, are there any concrete examples outside vfio-pci? If not, we can document it as per Lorenzo's suggestion to summarise this discussion under Documentation/.
On Fri, Oct 20, 2023 at 12:21:50PM +0100, Catalin Marinas wrote: > On Thu, Oct 19, 2023 at 08:51:42AM -0300, Jason Gunthorpe wrote: > > On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote: > > > Talking to Will earlier, I think we can deem the PCIe scenario > > > (somewhat) safe but not as a generic mechanism for other non-PCIe > > > devices (e.g. platform). With this concern, can we make this Stage 2 > > > relaxation in KVM only for vfio-pci mappings? I don't have an example of > > > non-PCIe device assignment to figure out how this should work though. > > > > It is not a KVM problem. As I implied above it is VFIO's > > responsibility to reliably reset the device, not KVMs. If for some > > reason VFIO cannot do that on some SOC then VFIO devices should not > > exist. > > > > It is not KVM's job to double guess VFIO's own security properties. > > I'd argue that since KVM is the one relaxing the memory attributes > beyond what the VFIO driver allows the VMM to use, it is KVM's job to > consider the security implications. This is fine for vfio-pci and > Normal_NC but I'm not sure we can generalise. I can see that, but I belive we should take this responsibility into VFIO as a requirement. As I said in the other email we do want to extend VFIO to support NormalNC VMAs for DPDK, so we need to take this anyhow. > > Specifically about platform the generic VFIO platform driver is the > > ACPI based one. If the FW provides an ACPI method for device reset > > that is not properly serializing, that is a FW bug. We can quirk it in > > VFIO and block using those devices if they actually exist. > > > > I expect the non-generic VFIO platform drivers to take care of this > > issue internally with, barriers, read from devices, whatver is > > required to make their SOCs order properly. Just as I would expect a > > normal Linux platform driver to directly manage whatever > > implementation specific ordering quirks the SOC may have. > > This would be a new requirement if an existing VFIO platform driver > relied on all mappings being Device. But maybe that's just theoretical > at the moment, are there any concrete examples outside vfio-pci? If not, > we can document it as per Lorenzo's suggestion to summarise this > discussion under Documentation/. My point is if this becomes a real world concern we have a solid answer on how to resolve it - fix the VFIO driver to have a stronger barrier before reset. I'm confident it is not a problem for PCI and IIRC the remaining ARM platform drivers were made primarily for DPDK, not KVM. So I agree with documenting and perhaps a comment someplace in VFIO is also warranted. Jason
On Fri, Oct 20, 2023 at 08:47:19AM -0300, Jason Gunthorpe wrote: > On Fri, Oct 20, 2023 at 12:21:50PM +0100, Catalin Marinas wrote: > > On Thu, Oct 19, 2023 at 08:51:42AM -0300, Jason Gunthorpe wrote: > > > On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote: > > > > Talking to Will earlier, I think we can deem the PCIe scenario > > > > (somewhat) safe but not as a generic mechanism for other non-PCIe > > > > devices (e.g. platform). With this concern, can we make this Stage 2 > > > > relaxation in KVM only for vfio-pci mappings? I don't have an example of > > > > non-PCIe device assignment to figure out how this should work though. > > > > > > It is not a KVM problem. As I implied above it is VFIO's > > > responsibility to reliably reset the device, not KVMs. If for some > > > reason VFIO cannot do that on some SOC then VFIO devices should not > > > exist. > > > > > > It is not KVM's job to double guess VFIO's own security properties. > > > > I'd argue that since KVM is the one relaxing the memory attributes > > beyond what the VFIO driver allows the VMM to use, it is KVM's job to > > consider the security implications. This is fine for vfio-pci and > > Normal_NC but I'm not sure we can generalise. > > I can see that, but I belive we should take this responsibility into > VFIO as a requirement. As I said in the other email we do want to > extend VFIO to support NormalNC VMAs for DPDK, so we need to take this > anyhow. > > > > Specifically about platform the generic VFIO platform driver is the > > > ACPI based one. If the FW provides an ACPI method for device reset > > > that is not properly serializing, that is a FW bug. We can quirk it in > > > VFIO and block using those devices if they actually exist. > > > > > > I expect the non-generic VFIO platform drivers to take care of this > > > issue internally with, barriers, read from devices, whatver is > > > required to make their SOCs order properly. Just as I would expect a > > > normal Linux platform driver to directly manage whatever > > > implementation specific ordering quirks the SOC may have. > > > > This would be a new requirement if an existing VFIO platform driver > > relied on all mappings being Device. But maybe that's just theoretical > > at the moment, are there any concrete examples outside vfio-pci? If not, > > we can document it as per Lorenzo's suggestion to summarise this > > discussion under Documentation/. > > My point is if this becomes a real world concern we have a solid > answer on how to resolve it - fix the VFIO driver to have a stronger > barrier before reset. Just to make sure I am parsing this correctly: this case above is related to a non-PCI VFIO device passthrough where a guest would want to map the device MMIO at stage-1 with normal-NC memory type (well, let's say with a memory attribute != device-nGnRE - that combined with the new stage-2 default might cause transactions ordering/grouping trouble with eg device resets), correct ? IIRC, all requests related to honouring "write-combine" style stage-1 mappings were for PCI(e) devices but that's as far as what *I* was made aware of goes. > I'm confident it is not a problem for PCI and IIRC the remaining ARM > platform drivers were made primarily for DPDK, not KVM. > > So I agree with documenting and perhaps a comment someplace in VFIO is > also warranted. We will do that, I will start adding the recent discussions to the new documentation file. Side note: for those who attend LPC it would be useful to review the resulting documentation together there, it should happen around v6.7-rc1. Lorenzo
On Fri, Oct 20, 2023 at 04:03:57PM +0200, Lorenzo Pieralisi wrote: > > My point is if this becomes a real world concern we have a solid > > answer on how to resolve it - fix the VFIO driver to have a stronger > > barrier before reset. > > Just to make sure I am parsing this correctly: this case above is > related to a non-PCI VFIO device passthrough where a guest would want to > map the device MMIO at stage-1 with normal-NC memory type (well, let's > say with a memory attribute != device-nGnRE - that combined with the new > stage-2 default might cause transactions ordering/grouping trouble with > eg device resets), correct ? This is what I have understood was Will's concern, yes. > IIRC, all requests related to honouring "write-combine" style > stage-1 mappings were for PCI(e) devices but that's as far as what > *I* was made aware of goes. Yes, this is what I am aware of as well. Though I do not object to the idea from the VFIO side that platform devices would also have to support NormalNC too. The theoretical missing peice is that someone would say they have a SOC issue XYZ and thus their VFIO platform devices must fully block NormalNC. I suggest if someone comes with this and really, really wants VFIO, then we could use a VMA flag to indicate that KVM must not upgrade it. Currently I have no knowledge of such a thing existing. With PCI we've made the argument that if NormalNC is broken unsafe for KVM in the SOC then probably so is Device-*. I think the same basic argument holds for platform devices too. Thus I'm skeptical that someone can come and say they have SOC issue XYZ and NormalNC is broken but Device-* is perfectly safe. > We will do that, I will start adding the recent discussions to the > new documentation file. Side note: for those who attend LPC it would be > useful to review the resulting documentation together there, it should > happen around v6.7-rc1. I will be there, let me know Thanks, Jason
On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote: [...] > > > Things can go wrong but that's not because Device does anything better. > > > Given the RAS implementation, external aborts caused on Device memory > > > (e.g. wrong size access) is uncontainable. For Normal NC it can be > > > contained (I can dig out the reasoning behind this if you want, IIUC > > > something to do with not being able to cancel an already issued Device > > > access since such accesses don't allow speculation due to side-effects; > > > for Normal NC, it's just about the software not getting the data). > > > > I really think these details belong in the commit message. > > I guess another task for Lorenzo ;). Updated commit log (it might be [is] too verbose) below, it should probably be moved into a documentation file but to do that I should decouple it from this changeset (ie a document explaining memory attributes and error containment for ARM64 - indipendent from KVM S2 defaults). I'd also add a Link: to the lore archive entry for reference (did not add it in the log below). Please let me know what's the best option here. Thanks, Lorenzo -- >8 -- Currently, KVM for ARM64 maps at stage 2 memory that is considered device (ie it is not RAM) with DEVICE_nGnRE memory attributes; this setting overrides (as per the ARM architecture [1]) any device MMIO mapping present at stage 1, resulting in a set-up whereby a guest operating system can't determine device MMIO mapping memory attributes on its own but it is always overriden by the KVM stage 2 default. This set-up does not allow guest operating systems to select device memory attributes on a page by page basis independently from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage 2 memory type attributes"), which turns out to be an issue in that guest operating systems (eg Linux) may request to map devices MMIO regions with memory attributes that guarantee better performance (eg gathering attribute - that for some devices can generate larger PCIe memory writes TLPs) and specific operations (eg unaligned transactions) such as the NormalNC memory type. The default device stage 2 mapping was chosen in KVM for ARM64 since it was considered safer (ie it would not allow guests to trigger uncontained failures ultimately crashing the machine) but this turned out to be imprecise. Failures containability is a property of the platform and is independent from the memory type used for MMIO device memory mappings. Actually, DEVICE_nGnRE memory type is even more problematic than eg Normal-NC memory type in terms of faults containability in that eg aborts triggered on DEVICE_nGnRE loads cannot be made, architecturally, synchronous (ie that would imply that the processor should issue at most 1 load transaction at a time - ie it can't pipeline them - otherwise the synchronous abort semantics would break the no-speculation attribute attached to DEVICE_XXX memory). This means that regardless of the combined stage1+stage2 mappings a platform is safe if and only if device transactions cannot trigger uncontained failures and that in turn relies on platform capabilities and the device type being assigned (ie PCIe AER/DPC error containment and RAS architecture[3]); therefore the default KVM device stage 2 memory attributes play no role in making device assignment safer for a given platform (if the platform design adheres to design guidelines outlined in [3]) and therefore can be relaxed. For all these reasons, relax the KVM stage 2 device memory attributes from DEVICE_nGnRE to Normal-NC. A different Normal memory type default at stage-2 (eg Normal Write-through) could have been chosen instead of Normal-NC but Normal-NC was chosen because: - Its attributes are well-understood compared to other Normal memory types for MMIO mappings - On systems implementing S2FWB (FEAT_S2FWB), that's the only sensible default for normal memory types. For systems implementing FEAT_S2FWB (see [1] D8.5.5 S2=stage-2 - S2 MemAttr[3:0]), the options to choose the memory types are as follows: if S2 MemAttr[2] == 0, the mapping defaults to DEVICE_XXX (XXX determined by S2 MemAttr[1:0]). This is what we have today (MemAttr[2:0] == 0b001) and therefore it is not described any further. if S2 MemAttr[2] == 1, there are the following options: S2 MemAttr[2:0] | Resulting mapping ----------------------------------------------------------------------------- 0b101 | Prevent the guest from selecting cachable memory type, it | allows it to choose Device-* or Normal-NC 0b110 | It forces write-back memory type; it breaks MMIO. 0b111 | Allow the VM to select any memory type including cachable. | It is unclear whether this is safe from a platform | perspective, especially wrt uncontained failures and | cacheability (eg device reclaim/reset and cache | maintenance). ------------------------------------------------------------------------------ - For !FEAT_S2FWB systems, it is logical to choose a default S2 mapping identical to FEAT_S2FWB (that basically would force Normal-NC, see option S2 MemAttr[2:0] == 0b101 above), to guarantee a coherent approach between the two Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to trigger any issue on guest device reclaim use cases either (ie device MMIO unmap followed by a device reset) at least for PCIe devices, in that in PCIe a device reset is architected and carried out through PCI config space transactions that are naturally ordered wrt MMIO transactions according to the PCI ordering rules. Having Normal-NC S2 default puts guests in control (thanks to stage1+stage2 combined memory attributes rules [1]) of device MMIO regions memory mappings, according to the rules described in [1] and summarized here ([(S1) - stage1], [(S2) - stage 2]): S1 | S2 | Result NORMAL-WB | NORMAL-NC | NORMAL-NC NORMAL-WT | NORMAL-NC | NORMAL-NC NORMAL-NC | NORMAL-NC | NORMAL-NC DEVICE<attr> | NORMAL-NC | DEVICE<attr> It is worth noting that currently, to map devices MMIO space to user space in a device pass-through use case the VFIO framework applies memory attributes derived from pgprot_noncached() settings applied to VMAs, which result in device-nGnRnE memory attributes for the stage-1 VMM mappings. This means that a userspace mapping for device MMIO space carried out with the current VFIO framework and a guest OS mapping for the same MMIO space may result in a mismatched alias as described in [2]. Defaulting KVM device stage-2 mappings to Normal-NC attributes does not change anything in this respect, in that the mismatched aliases would only affect (refer to [2] for a detailed explanation) ordering between the userspace and GuestOS mappings resulting stream of transactions (ie it does not cause loss of property for either stream of transactions on its own), which is harmless given that the userspace and GuestOS access to the device is carried out through independent transactions streams. [1] section D8.5 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf [2] section B2.8 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf [3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf
On Thu, Nov 09, 2023 at 04:34:10PM +0100, Lorenzo Pieralisi wrote: > Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to > trigger any issue on guest device reclaim use cases either (ie device > MMIO unmap followed by a device reset) at least for PCIe devices, in that > in PCIe a device reset is architected and carried out through PCI config > space transactions that are naturally ordered wrt MMIO transactions > according to the PCI ordering rules. This is not how I see that thread concluding.. The device reclaim problem belongs solely to VFIO, not KVM. VFIO must ensure global ordering of access before the VMA is unmaped and access after, that includes ordering whatever mechanism the VFIO driver uses for reset. If there are quirky SOCs, or non-PCI devices that need something stronger than the TLBI/etc sequence it should be fixed in VFIO (or maybe even the arch code), not by blocking NORMAL_NC in the KVM. Such a quirky SOC would broadly have security issues beyond KVM. > It is worth noting that currently, to map devices MMIO space to user > space in a device pass-through use case the VFIO framework applies memory > attributes derived from pgprot_noncached() settings applied to VMAs, which Sometimes. VFIO uses a mix of pgprot_noncached and pgprot_device. AFAIK we should change to to always use pgprot_device.. Thanks, Jason
On Fri, Nov 10, 2023 at 10:26:49AM -0400, Jason Gunthorpe wrote: > On Thu, Nov 09, 2023 at 04:34:10PM +0100, Lorenzo Pieralisi wrote: > > > Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to > > trigger any issue on guest device reclaim use cases either (ie device > > MMIO unmap followed by a device reset) at least for PCIe devices, in that > > in PCIe a device reset is architected and carried out through PCI config > > space transactions that are naturally ordered wrt MMIO transactions > > according to the PCI ordering rules. > > This is not how I see that thread concluding.. > > The device reclaim problem belongs solely to VFIO, not KVM. VFIO must > ensure global ordering of access before the VMA is unmaped and access > after, that includes ordering whatever mechanism the VFIO driver uses > for reset. > > If there are quirky SOCs, or non-PCI devices that need something > stronger than the TLBI/etc sequence it should be fixed in VFIO (or > maybe even the arch code), not by blocking NORMAL_NC in the KVM. Such > a quirky SOC would broadly have security issues beyond KVM. https://lore.kernel.org/linux-arm-kernel/20231013092934.GA13524@willie-the-truck I think that Will's point _was_ related to the change we are making for KVM S2 mappings and related device transactions on device reclaim - ie reset, that's what I tried to convey (I probably simplified too much) that for PCI at least that should not trigger any regression/issue, in that BAR MMIO and reset transactions are decoupled streams and must follow the PCI ordering rules. Yes, it is VFIO responsibility but changing the S2 KVM mappings may have (for non-PCI devices) side effects compared to what we have today, I am not saying this should be a blocker I just summarized the thread above, the paragraph can be expanded. Lorenzo > > > It is worth noting that currently, to map devices MMIO space to user > > space in a device pass-through use case the VFIO framework applies memory > > attributes derived from pgprot_noncached() settings applied to VMAs, which > > Sometimes. VFIO uses a mix of pgprot_noncached and pgprot_device. AFAIK > we should change to to always use pgprot_device.. > > Thanks, > Jason
Hi Lorenzo, Thanks for putting this together. It looks fine to me, just some general comments below. On Thu, Nov 09, 2023 at 04:34:10PM +0100, Lorenzo Pieralisi wrote: > Updated commit log (it might be [is] too verbose) below, it should probably > be moved into a documentation file but to do that I should decouple > it from this changeset (ie a document explaining memory attributes > and error containment for ARM64 - indipendent from KVM S2 defaults). > > I'd also add a Link: to the lore archive entry for reference (did not > add it in the log below). > > Please let me know what's the best option here. It would be good to have this under Documentation/arch/arm64/, though as a document at the end of the series describing what the new behaviour is, the kernel/KVM expectations (after the patches have been applied). We could even submit it separately once we agreed on how the series looks like. > -- >8 -- > Currently, KVM for ARM64 maps at stage 2 memory that is > considered device (ie it is not RAM) with DEVICE_nGnRE > memory attributes; this setting overrides (as per the ARM > architecture [1]) any device MMIO mapping present at stage > 1, resulting in a set-up whereby a guest operating system > can't determine device MMIO mapping memory attributes on its > own but it is always overriden by the KVM stage 2 default. To be fully correct, stage 1 and stage 2 attributes combine in a way that results in the more restrictive properties, so not a simple override. But to keep things simple, especially if this part only goes in the commit log, leave it as is. > This set-up does not allow guest operating systems to select > device memory attributes on a page by page basis independently Not sure it's even worth specifying "page by page". It just doesn't allow the guest to relax the device map memory attributes at all. > from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage > 2 memory type attributes"), which turns out to be an issue in that > guest operating systems (eg Linux) may request to map > devices MMIO regions with memory attributes that guarantee > better performance (eg gathering attribute - that for some > devices can generate larger PCIe memory writes TLPs) > and specific operations (eg unaligned transactions) such as > the NormalNC memory type. Another case is correct guest behaviour where it would not expect unaligned accesses from Normal NC mappings. > The default device stage 2 mapping was chosen in KVM > for ARM64 since it was considered safer (ie it would > not allow guests to trigger uncontained failures > ultimately crashing the machine) but this turned out > to be imprecise. s/imprecise/asynchronous (SError)/ Another reason was probably that it matches the default VFIO mapping (though the latter is slightly stricter as in Device_nGnRnE). > Failures containability is a property of the platform > and is independent from the memory type used for MMIO > device memory mappings. > > Actually, DEVICE_nGnRE memory type is even more problematic > than eg Normal-NC memory type in terms of faults containability > in that eg aborts triggered on DEVICE_nGnRE loads cannot be made, > architecturally, synchronous (ie that would imply that the processor > should issue at most 1 load transaction at a time - ie it can't pipeline > them - otherwise the synchronous abort semantics would break the > no-speculation attribute attached to DEVICE_XXX memory). > > This means that regardless of the combined stage1+stage2 mappings a > platform is safe if and only if device transactions cannot trigger > uncontained failures and that in turn relies on platform > capabilities and the device type being assigned (ie PCIe AER/DPC > error containment and RAS architecture[3]); therefore the default > KVM device stage 2 memory attributes play no role in making device > assignment safer for a given platform (if the platform design > adheres to design guidelines outlined in [3]) and therefore can > be relaxed. > > For all these reasons, relax the KVM stage 2 device > memory attributes from DEVICE_nGnRE to Normal-NC. I think this covers the safety aspect w.r.t. uncontained errors. > A different Normal memory type default at stage-2 > (eg Normal Write-through) could have been chosen > instead of Normal-NC but Normal-NC was chosen > because: > > - Its attributes are well-understood compared to > other Normal memory types for MMIO mappings Ideally we'd have gone for Device_GRE from a performance perspective but it doesn't support unaligned accesses. Basically at this stage we don't want cacheable accesses to MMIO until we gain a better understanding on what the impact would be. > - On systems implementing S2FWB (FEAT_S2FWB), that's the only sensible > default for normal memory types. For systems implementing > FEAT_S2FWB (see [1] D8.5.5 S2=stage-2 - S2 MemAttr[3:0]), the options > to choose the memory types are as follows: > > if S2 MemAttr[2] == 0, the mapping defaults to DEVICE_XXX > (XXX determined by S2 MemAttr[1:0]). This is what we have > today (MemAttr[2:0] == 0b001) and therefore it is not described > any further. > > if S2 MemAttr[2] == 1, there are the following options: > > S2 MemAttr[2:0] | Resulting mapping > ----------------------------------------------------------------------------- > 0b101 | Prevent the guest from selecting cachable memory type, it > | allows it to choose Device-* or Normal-NC > 0b110 | It forces write-back memory type; it breaks MMIO. > 0b111 | Allow the VM to select any memory type including cachable. > | It is unclear whether this is safe from a platform > | perspective, especially wrt uncontained failures and > | cacheability (eg device reclaim/reset and cache > | maintenance). > ------------------------------------------------------------------------------ > > - For !FEAT_S2FWB systems, it is logical to choose a default S2 mapping > identical to FEAT_S2FWB (that basically would force Normal-NC, see > option S2 MemAttr[2:0] == 0b101 above), to guarantee a coherent approach > between the two Is this text only to say that we can't use write-through memory because with FEAT_S2FWB it is forced cacheable? I'd probably keep in simple and just state that we went for Normal NC to avoid cache allocation/snooping. It just crossed my mind, I think we also assume here that on platforms with transparent caches, Normal NC of MMIO won't be upgraded to cacheable (that's irrespective of KVM). This may need to be stated in some Arm BSA document (I'm not aware of any arch rules that prevent this from happening). AFAIK, platforms with transparent caches only do this for the memory ranges where the DRAM is expected. > Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to > trigger any issue on guest device reclaim use cases either (ie device > MMIO unmap followed by a device reset) at least for PCIe devices, in that > in PCIe a device reset is architected and carried out through PCI config > space transactions that are naturally ordered wrt MMIO transactions > according to the PCI ordering rules. We could state this in a doc that for device pass-through, there's an expectation that the device can be reclaimed along the lines of the PCIe reset behaviour. If one admin decides to allow device pass-through for some unsafe devices/platforms, it's their problem really. I don't think a Device_nGnRE mapping would help in those cases anyway. This could be expanded to Normal Cacheable at some point but KVM would have to do cache maintenance after unmapping the device from the guest (so let's leave this for a separate series). > Having Normal-NC S2 default puts guests in control (thanks to > stage1+stage2 combined memory attributes rules [1]) of device MMIO > regions memory mappings, according to the rules described in [1] > and summarized here ([(S1) - stage1], [(S2) - stage 2]): > > S1 | S2 | Result > NORMAL-WB | NORMAL-NC | NORMAL-NC > NORMAL-WT | NORMAL-NC | NORMAL-NC > NORMAL-NC | NORMAL-NC | NORMAL-NC > DEVICE<attr> | NORMAL-NC | DEVICE<attr> > > It is worth noting that currently, to map devices MMIO space to user > space in a device pass-through use case the VFIO framework applies memory > attributes derived from pgprot_noncached() settings applied to VMAs, which > result in device-nGnRnE memory attributes for the stage-1 VMM mappings. > > This means that a userspace mapping for device MMIO space carried > out with the current VFIO framework and a guest OS mapping for the same > MMIO space may result in a mismatched alias as described in [2]. > > Defaulting KVM device stage-2 mappings to Normal-NC attributes does not change > anything in this respect, in that the mismatched aliases would only affect > (refer to [2] for a detailed explanation) ordering between the userspace and > GuestOS mappings resulting stream of transactions (ie it does not cause loss of > property for either stream of transactions on its own), which is harmless > given that the userspace and GuestOS access to the device is carried > out through independent transactions streams. I think this was a recent clarification from the architects (in private emails). My understanding for some time has been that the mere presence of a Normal alias would affect the semantics of the user (VMM) accesses to the Device mapping. Apparently, this only matters if there is some expected ordering between the user (VMM) access and the VM one with different aliases. I can't tell where the mismatched aliases rules state this but I haven't read them in a while.
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h index fde4186cc387..c247e5f29d5a 100644 --- a/arch/arm64/include/asm/memory.h +++ b/arch/arm64/include/asm/memory.h @@ -147,6 +147,7 @@ * Memory types for Stage-2 translation */ #define MT_S2_NORMAL 0xf +#define MT_S2_NORMAL_NC 0x5 #define MT_S2_DEVICE_nGnRE 0x1 /* @@ -154,6 +155,7 @@ * Stage-2 enforces Normal-WB and Device-nGnRE */ #define MT_S2_FWB_NORMAL 6 +#define MT_S2_FWB_NORMAL_NC 5 #define MT_S2_FWB_DEVICE_nGnRE 1 #ifdef CONFIG_ARM64_4K_PAGES diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index ccd291b6893d..a80949002191 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -696,7 +696,7 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p kvm_pte_t *ptep) { bool device = prot & KVM_PGTABLE_PROT_DEVICE; - kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, DEVICE_nGnRE) : + kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, NORMAL_NC) : KVM_S2_MEMATTR(pgt, NORMAL); u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS;