Message ID | 20231205033015.10044-1-ankita@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2,1/1] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory | expand |
+ Shameer On Tue, 05 Dec 2023 03:30:15 +0000, <ankita@nvidia.com> wrote: > > From: Ankit Agrawal <ankita@nvidia.com> > > Currently, KVM for ARM64 maps at stage 2 memory that is considered device > (i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting > overrides (as per the ARM architecture [1]) any device MMIO mapping > present at stage 1, resulting in a set-up whereby a guest operating > system cannot determine device MMIO mapping memory attributes on its > own but it is always overridden by the KVM stage 2 default. > > This set-up does not allow guest operating systems to select device > memory attributes independently from KVM stage-2 mappings > (refer to [1], "Combining stage 1 and stage 2 memory type attributes"), > which turns out to be an issue in that guest operating systems > (e.g. Linux) may request to map devices MMIO regions with memory > attributes that guarantee better performance (e.g. gathering > attribute - that for some devices can generate larger PCIe memory > writes TLPs) and specific operations (e.g. unaligned transactions) > such as the NormalNC memory type. > > The default device stage 2 mapping was chosen in KVM for ARM64 since > it was considered safer (i.e. it would not allow guests to trigger > uncontained failures ultimately crashing the machine) but this > turned out to be asynchronous (SError) defeating the purpose. > > Failures containability is a property of the platform and is independent > from the memory type used for MMIO device memory mappings. > > Actually, DEVICE_nGnRE memory type is even more problematic than > Normal-NC memory type in terms of faults containability in that e.g. > aborts triggered on DEVICE_nGnRE loads cannot be made, architecturally, > synchronous (i.e. that would imply that the processor should issue at > most 1 load transaction at a time - it cannot pipeline them - otherwise > the synchronous abort semantics would break the no-speculation attribute > attached to DEVICE_XXX memory). > > This means that regardless of the combined stage1+stage2 mappings a > platform is safe if and only if device transactions cannot trigger > uncontained failures and that in turn relies on platform capabilities > and the device type being assigned (i.e. PCIe AER/DPC error containment > and RAS architecture[3]); therefore the default KVM device stage 2 > memory attributes play no role in making device assignment safer > for a given platform (if the platform design adheres to design > guidelines outlined in [3]) and therefore can be relaxed. > > For all these reasons, relax the KVM stage 2 device memory attributes > from DEVICE_nGnRE to Normal-NC. Add a new kvm_pgtable_prot flag for > Normal-NC. > > The Normal-NC was chosen over a different Normal memory type default > at stage-2 (e.g. Normal Write-through) to avoid cache allocation/snooping. > > Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to > trigger any issue on guest device reclaim use cases either (i.e. device > MMIO unmap followed by a device reset) at least for PCIe devices, in that > in PCIe a device reset is architected and carried out through PCI config > space transactions that are naturally ordered with respect to MMIO > transactions according to the PCI ordering rules. > > Having Normal-NC S2 default puts guests in control (thanks to > stage1+stage2 combined memory attributes rules [1]) of device MMIO > regions memory mappings, according to the rules described in [1] > and summarized here ([(S1) - stage1], [(S2) - stage 2]): > > S1 | S2 | Result > NORMAL-WB | NORMAL-NC | NORMAL-NC > NORMAL-WT | NORMAL-NC | NORMAL-NC > NORMAL-NC | NORMAL-NC | NORMAL-NC > DEVICE<attr> | NORMAL-NC | DEVICE<attr> > > It is worth noting that currently, to map devices MMIO space to user > space in a device pass-through use case the VFIO framework applies memory > attributes derived from pgprot_noncached() settings applied to VMAs, which > result in device-nGnRnE memory attributes for the stage-1 VMM mappings. > > This means that a userspace mapping for device MMIO space carried > out with the current VFIO framework and a guest OS mapping for the same > MMIO space may result in a mismatched alias as described in [2]. > > Defaulting KVM device stage-2 mappings to Normal-NC attributes does not > change anything in this respect, in that the mismatched aliases would > only affect (refer to [2] for a detailed explanation) ordering between > the userspace and GuestOS mappings resulting stream of transactions > (i.e. it does not cause loss of property for either stream of > transactions on its own), which is harmless given that the userspace > and GuestOS access to the device is carried out through independent > transactions streams. > > [1] section D8.5 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf > [2] section B2.8 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf Can you please quote the latest specs? > [3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf > > Applied over next-20231201 > > History > ======= > v1 -> v2 > - Updated commit log to the one posted by > Lorenzo Pieralisi <lpieralisi@kernel.org> (Thanks!) > - Added new flag to represent the NORMAL_NC setting. Updated > stage2_set_prot_attr() to handle new flag. > > v1 Link: > https://lore.kernel.org/all/20230907181459.18145-3-ankita@nvidia.com/ > > Signed-off-by: Ankit Agrawal <ankita@nvidia.com> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > Acked-by: Catalin Marinas <catalin.marinas@arm.com> > Tested-by: Ankit Agrawal <ankita@nvidia.com> Despite the considerable increase in the commit message length, a number of questions are left unanswered: - Shameer reported a regression on non-FWB systems, breaking device assignment: https://lore.kernel.org/all/af13ed63dc9a4f26a6c958ebfa77d78a@huawei.com/ How has this been addressed? - Will had unanswered questions in another part of the thread: https://lore.kernel.org/all/20231013092954.GB13524@willie-the-truck/ Can someone please help concluding it? > > --- > arch/arm64/include/asm/kvm_pgtable.h | 2 ++ > arch/arm64/include/asm/memory.h | 2 ++ > arch/arm64/kvm/hyp/pgtable.c | 11 +++++++++-- > arch/arm64/kvm/mmu.c | 4 ++-- > 4 files changed, 15 insertions(+), 4 deletions(-) > > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h > index cfdf40f734b1..19278dfe7978 100644 > --- a/arch/arm64/include/asm/kvm_pgtable.h > +++ b/arch/arm64/include/asm/kvm_pgtable.h > @@ -197,6 +197,7 @@ enum kvm_pgtable_stage2_flags { > * @KVM_PGTABLE_PROT_W: Write permission. > * @KVM_PGTABLE_PROT_R: Read permission. > * @KVM_PGTABLE_PROT_DEVICE: Device attributes. > + * @KVM_PGTABLE_PROT_NORMAL_NC: Normal noncacheable attributes. > * @KVM_PGTABLE_PROT_SW0: Software bit 0. > * @KVM_PGTABLE_PROT_SW1: Software bit 1. > * @KVM_PGTABLE_PROT_SW2: Software bit 2. > @@ -208,6 +209,7 @@ enum kvm_pgtable_prot { > KVM_PGTABLE_PROT_R = BIT(2), > > KVM_PGTABLE_PROT_DEVICE = BIT(3), > + KVM_PGTABLE_PROT_NORMAL_NC = BIT(4), > > KVM_PGTABLE_PROT_SW0 = BIT(55), > KVM_PGTABLE_PROT_SW1 = BIT(56), > diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h > index fde4186cc387..c247e5f29d5a 100644 > --- a/arch/arm64/include/asm/memory.h > +++ b/arch/arm64/include/asm/memory.h > @@ -147,6 +147,7 @@ > * Memory types for Stage-2 translation > */ > #define MT_S2_NORMAL 0xf > +#define MT_S2_NORMAL_NC 0x5 > #define MT_S2_DEVICE_nGnRE 0x1 > > /* > @@ -154,6 +155,7 @@ > * Stage-2 enforces Normal-WB and Device-nGnRE > */ > #define MT_S2_FWB_NORMAL 6 > +#define MT_S2_FWB_NORMAL_NC 5 > #define MT_S2_FWB_DEVICE_nGnRE 1 > > #ifdef CONFIG_ARM64_4K_PAGES > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > index c651df904fe3..d4835d553c61 100644 > --- a/arch/arm64/kvm/hyp/pgtable.c > +++ b/arch/arm64/kvm/hyp/pgtable.c > @@ -718,10 +718,17 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p > kvm_pte_t *ptep) > { > bool device = prot & KVM_PGTABLE_PROT_DEVICE; > - kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, DEVICE_nGnRE) : > - KVM_S2_MEMATTR(pgt, NORMAL); > + bool normal_nc = prot & KVM_PGTABLE_PROT_NORMAL_NC; > + kvm_pte_t attr; > u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS; > > + if (device) > + attr = KVM_S2_MEMATTR(pgt, DEVICE_nGnRE); > + else if (normal_nc) > + attr = KVM_S2_MEMATTR(pgt, NORMAL_NC); > + else > + attr = KVM_S2_MEMATTR(pgt, NORMAL); > + > if (!(prot & KVM_PGTABLE_PROT_X)) > attr |= KVM_PTE_LEAF_ATTR_HI_S2_XN; > else if (device) > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index d14504821b79..1cb302457d3f 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -1071,7 +1071,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa, > struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO }; > struct kvm_s2_mmu *mmu = &kvm->arch.mmu; > struct kvm_pgtable *pgt = mmu->pgt; > - enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE | > + enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_NORMAL_NC | > KVM_PGTABLE_PROT_R | > (writable ? KVM_PGTABLE_PROT_W : 0); Doesn't this affect the GICv2 VCPU interface, which is effectively a shared peripheral, now allowing a guest to affect another guest's interrupt distribution? If that is the case, this needs to be fixed. In general, I don't think this should be a blanket statement, but be limited to devices that we presume can deal with this (i.e. PCIe, and not much else). > > @@ -1558,7 +1558,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > prot |= KVM_PGTABLE_PROT_X; > > if (device) > - prot |= KVM_PGTABLE_PROT_DEVICE; > + prot |= KVM_PGTABLE_PROT_NORMAL_NC; > else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) > prot |= KVM_PGTABLE_PROT_X; > Thanks, M.
+ Lorenzo, he really needs to be on this thread. On Tue, Dec 05, 2023 at 09:21:28AM +0000, Marc Zyngier wrote: > On Tue, 05 Dec 2023 03:30:15 +0000, > <ankita@nvidia.com> wrote: > > From: Ankit Agrawal <ankita@nvidia.com> > > > > Currently, KVM for ARM64 maps at stage 2 memory that is considered device > > (i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting > > overrides (as per the ARM architecture [1]) any device MMIO mapping > > present at stage 1, resulting in a set-up whereby a guest operating > > system cannot determine device MMIO mapping memory attributes on its > > own but it is always overridden by the KVM stage 2 default. [...] > Despite the considerable increase in the commit message length, a > number of questions are left unanswered: > > - Shameer reported a regression on non-FWB systems, breaking device > assignment: > > https://lore.kernel.org/all/af13ed63dc9a4f26a6c958ebfa77d78a@huawei.com/ This referred to the first patch in the old series which was trying to make the Stage 2 cacheable based on the vma attributes. That patch has been dropped for now. It would be good if Shameer confirms this was the problem. > - Will had unanswered questions in another part of the thread: > > https://lore.kernel.org/all/20231013092954.GB13524@willie-the-truck/ > > Can someone please help concluding it? Is this about reclaiming the device? I think we concluded that we can't generalise this beyond PCIe, though not sure there was any formal statement to that thread. The other point Will had was around stating in the commit message why we only relax this to Normal NC. I haven't checked the commit message yet, it needs careful reading ;). > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > index d14504821b79..1cb302457d3f 100644 > > --- a/arch/arm64/kvm/mmu.c > > +++ b/arch/arm64/kvm/mmu.c > > @@ -1071,7 +1071,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa, > > struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO }; > > struct kvm_s2_mmu *mmu = &kvm->arch.mmu; > > struct kvm_pgtable *pgt = mmu->pgt; > > - enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE | > > + enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_NORMAL_NC | > > KVM_PGTABLE_PROT_R | > > (writable ? KVM_PGTABLE_PROT_W : 0); > > Doesn't this affect the GICv2 VCPU interface, which is effectively a > shared peripheral, now allowing a guest to affect another guest's > interrupt distribution? If that is the case, this needs to be fixed. > > In general, I don't think this should be a blanket statement, but be > limited to devices that we presume can deal with this (i.e. PCIe, and > not much else). Based on other on-list and off-line discussions, I came to the same conclusion that we can't relax this beyond PCIe. How we do this, it's up for debate. Some ideas: - something in the vfio-pci driver like a new VM_* flag that KVM can consume (hard sell for an arch-specific thing) - changing the vfio-pci driver to allow WC and NC mappings (x86 terminology) with additional knowledge about what it can safely map as NC. KVM would mimic the vma attributes (it doesn't eliminate the alias though, the guest can always go for Device while the VMM for Normal) - some per-SoC pfn ranges that are permitted as Normal NC. Can the arch code figure out where these PCIe BARs are or is it only the vfio-pci driver? I guess we can sort something out around the PCIe but I'm not familiar with this subsystem. The alternative is some "safe" ranges in firmware tables, assuming the firmware configures the PCIe BARs location
On Tue, Dec 05, 2023 at 09:00:15AM +0530, ankita@nvidia.com wrote: > From: Ankit Agrawal <ankita@nvidia.com> > > Currently, KVM for ARM64 maps at stage 2 memory that is considered device > (i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting > overrides (as per the ARM architecture [1]) any device MMIO mapping > present at stage 1, resulting in a set-up whereby a guest operating > system cannot determine device MMIO mapping memory attributes on its > own but it is always overridden by the KVM stage 2 default. > > This set-up does not allow guest operating systems to select device > memory attributes independently from KVM stage-2 mappings > (refer to [1], "Combining stage 1 and stage 2 memory type attributes"), > which turns out to be an issue in that guest operating systems > (e.g. Linux) may request to map devices MMIO regions with memory > attributes that guarantee better performance (e.g. gathering > attribute - that for some devices can generate larger PCIe memory > writes TLPs) and specific operations (e.g. unaligned transactions) > such as the NormalNC memory type. > > The default device stage 2 mapping was chosen in KVM for ARM64 since > it was considered safer (i.e. it would not allow guests to trigger > uncontained failures ultimately crashing the machine) but this > turned out to be asynchronous (SError) defeating the purpose. > > Failures containability is a property of the platform and is independent > from the memory type used for MMIO device memory mappings. > > Actually, DEVICE_nGnRE memory type is even more problematic than > Normal-NC memory type in terms of faults containability in that e.g. > aborts triggered on DEVICE_nGnRE loads cannot be made, architecturally, > synchronous (i.e. that would imply that the processor should issue at > most 1 load transaction at a time - it cannot pipeline them - otherwise > the synchronous abort semantics would break the no-speculation attribute > attached to DEVICE_XXX memory). > > This means that regardless of the combined stage1+stage2 mappings a > platform is safe if and only if device transactions cannot trigger > uncontained failures and that in turn relies on platform capabilities > and the device type being assigned (i.e. PCIe AER/DPC error containment > and RAS architecture[3]); therefore the default KVM device stage 2 > memory attributes play no role in making device assignment safer > for a given platform (if the platform design adheres to design > guidelines outlined in [3]) and therefore can be relaxed. > > For all these reasons, relax the KVM stage 2 device memory attributes > from DEVICE_nGnRE to Normal-NC. Add a new kvm_pgtable_prot flag for > Normal-NC. > > The Normal-NC was chosen over a different Normal memory type default > at stage-2 (e.g. Normal Write-through) to avoid cache allocation/snooping. > > Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to > trigger any issue on guest device reclaim use cases either (i.e. device > MMIO unmap followed by a device reset) at least for PCIe devices, in that > in PCIe a device reset is architected and carried out through PCI config > space transactions that are naturally ordered with respect to MMIO > transactions according to the PCI ordering rules. > > Having Normal-NC S2 default puts guests in control (thanks to > stage1+stage2 combined memory attributes rules [1]) of device MMIO > regions memory mappings, according to the rules described in [1] > and summarized here ([(S1) - stage1], [(S2) - stage 2]): > > S1 | S2 | Result > NORMAL-WB | NORMAL-NC | NORMAL-NC > NORMAL-WT | NORMAL-NC | NORMAL-NC > NORMAL-NC | NORMAL-NC | NORMAL-NC > DEVICE<attr> | NORMAL-NC | DEVICE<attr> > > It is worth noting that currently, to map devices MMIO space to user > space in a device pass-through use case the VFIO framework applies memory > attributes derived from pgprot_noncached() settings applied to VMAs, which > result in device-nGnRnE memory attributes for the stage-1 VMM mappings. > > This means that a userspace mapping for device MMIO space carried > out with the current VFIO framework and a guest OS mapping for the same > MMIO space may result in a mismatched alias as described in [2]. > > Defaulting KVM device stage-2 mappings to Normal-NC attributes does not > change anything in this respect, in that the mismatched aliases would > only affect (refer to [2] for a detailed explanation) ordering between > the userspace and GuestOS mappings resulting stream of transactions > (i.e. it does not cause loss of property for either stream of > transactions on its own), which is harmless given that the userspace > and GuestOS access to the device is carried out through independent > transactions streams. > > [1] section D8.5 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf > [2] section B2.8 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf > [3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf > > Applied over next-20231201 > > History > ======= > v1 -> v2 > - Updated commit log to the one posted by > Lorenzo Pieralisi <lpieralisi@kernel.org> (Thanks!) > - Added new flag to represent the NORMAL_NC setting. Updated > stage2_set_prot_attr() to handle new flag. > > v1 Link: > https://lore.kernel.org/all/20230907181459.18145-3-ankita@nvidia.com/ > > Signed-off-by: Ankit Agrawal <ankita@nvidia.com> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > Acked-by: Catalin Marinas <catalin.marinas@arm.com> In the light of having to keep this relaxation only for PCIe devices, I will withdraw my Ack until we come to a conclusion. Thanks.
On Tue, Dec 05, 2023 at 11:40:47AM +0000, Catalin Marinas wrote: > > - Will had unanswered questions in another part of the thread: > > > > https://lore.kernel.org/all/20231013092954.GB13524@willie-the-truck/ > > > > Can someone please help concluding it? > > Is this about reclaiming the device? I think we concluded that we can't > generalise this beyond PCIe, though not sure there was any formal > statement to that thread. The other point Will had was around stating > in the commit message why we only relax this to Normal NC. I haven't > checked the commit message yet, it needs careful reading ;). Not quite, we said reclaiming is VFIO's problem and if VFIO can't reliably reclaim a device it shouldn't create it in the first place. Again, I think alot of this is trying to take VFIO problems into KVM. VFIO devices should not exist if they pose a harm to the system. If VFIO decided to create the devices anyhow (eg admin override or something) then it is not KVM's job to do any further enforcement. Remember, the feedback we got from the CPU architects was that even DEVICE_* will experience an uncontained failure if the device tiggers an error response in shipping ARM IP. The reason PCIe is safe is because the PCI bridge does not generate errors in the first place! Thus, the way a platform device can actually be safe is if it too never generates errors in the first place! Obviously this approach works just as well with NORMAL_NC. If a platform device does generate errors then we shouldn't expect containment at all, and the memory type has no bearing on the safety. The correct answer is to block these platform devices from VFIO/KVM/etc because they can trigger uncontained failures. If you have learned something different then please share it.. IOW, what is the practical scenario where DEVICE_* has contained errors but NORMAL_NC does not? Jason
[+James] On Tue, Dec 05, 2023 at 11:40:47AM +0000, Catalin Marinas wrote: [...] > > - Will had unanswered questions in another part of the thread: > > > > https://lore.kernel.org/all/20231013092954.GB13524@willie-the-truck/ > > > > Can someone please help concluding it? > > Is this about reclaiming the device? I think we concluded that we can't > generalise this beyond PCIe, though not sure there was any formal > statement to that thread. The other point Will had was around stating > in the commit message why we only relax this to Normal NC. I haven't > checked the commit message yet, it needs careful reading ;). 1) Reclaiming the device: I wrote a section that was reported in this commit log hunk: "Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to trigger any issue on guest device reclaim use cases either (i.e. device MMIO unmap followed by a device reset) at least for PCIe devices, in that in PCIe a device reset is architected and carried out through PCI config space transactions that are naturally ordered with respect to MMIO transactions according to the PCI ordering rules." It is not too verbose on purpose - I thought it is too complicated to express the details in a commit log but we can elaborate on that if it is beneficial, probably in /Documentation, not in the log itself. 2) On FWB: I added a full section to the log I posted here: https://lore.kernel.org/all/ZUz78gFPgMupew+m@lpieralisi but I think Ankit trimmed it and I can't certainly blame anyone for that, it is a commit log, it is already hard to digest as it is. I added James because I think that most of the points I made in the logs are really RAS related (and without him I would not have understood half of them :)). It would be very beneficial to everyone to have those added to kernel documentation - especially to express in architectural terms why this it is a safe change to make (at least for PCIe devices). I am happy to work on the documentation changes - let's just agree what should be done next. Thanks, Lorenzo > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > > index d14504821b79..1cb302457d3f 100644 > > > --- a/arch/arm64/kvm/mmu.c > > > +++ b/arch/arm64/kvm/mmu.c > > > @@ -1071,7 +1071,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa, > > > struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO }; > > > struct kvm_s2_mmu *mmu = &kvm->arch.mmu; > > > struct kvm_pgtable *pgt = mmu->pgt; > > > - enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE | > > > + enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_NORMAL_NC | > > > KVM_PGTABLE_PROT_R | > > > (writable ? KVM_PGTABLE_PROT_W : 0); > > > > Doesn't this affect the GICv2 VCPU interface, which is effectively a > > shared peripheral, now allowing a guest to affect another guest's > > interrupt distribution? If that is the case, this needs to be fixed. > > > > In general, I don't think this should be a blanket statement, but be > > limited to devices that we presume can deal with this (i.e. PCIe, and > > not much else). > > Based on other on-list and off-line discussions, I came to the same > conclusion that we can't relax this beyond PCIe. How we do this, it's up > for debate. Some ideas: > > - something in the vfio-pci driver like a new VM_* flag that KVM can > consume (hard sell for an arch-specific thing) > > - changing the vfio-pci driver to allow WC and NC mappings (x86 > terminology) with additional knowledge about what it can safely map > as NC. KVM would mimic the vma attributes (it doesn't eliminate the > alias though, the guest can always go for Device while the VMM for > Normal) > > - some per-SoC pfn ranges that are permitted as Normal NC. Can the arch > code figure out where these PCIe BARs are or is it only the vfio-pci > driver? I guess we can sort something out around the PCIe but I'm not > familiar with this subsystem. The alternative is some "safe" ranges in > firmware tables, assuming the firmware configures the PCIe BARs > location > > -- > Catalin
> -----Original Message----- > From: Catalin Marinas <catalin.marinas@arm.com> > Sent: Tuesday, December 5, 2023 11:41 AM > To: Marc Zyngier <maz@kernel.org> > Cc: ankita@nvidia.com; Shameerali Kolothum Thodi > <shameerali.kolothum.thodi@huawei.com>; jgg@nvidia.com; > oliver.upton@linux.dev; suzuki.poulose@arm.com; yuzenghui > <yuzenghui@huawei.com>; will@kernel.org; ardb@kernel.org; akpm@linux- > foundation.org; gshan@redhat.com; aniketa@nvidia.com; cjia@nvidia.com; > kwankhede@nvidia.com; targupta@nvidia.com; vsethi@nvidia.com; > acurrid@nvidia.com; apopple@nvidia.com; jhubbard@nvidia.com; > danw@nvidia.com; mochs@nvidia.com; kvmarm@lists.linux.dev; > kvm@vger.kernel.org; lpieralisi@kernel.org; linux-kernel@vger.kernel.org; > linux-arm-kernel@lists.infradead.org > Subject: Re: [PATCH v2 1/1] KVM: arm64: allow the VM to select DEVICE_* > and NORMAL_NC for IO memory > > + Lorenzo, he really needs to be on this thread. > > On Tue, Dec 05, 2023 at 09:21:28AM +0000, Marc Zyngier wrote: > > On Tue, 05 Dec 2023 03:30:15 +0000, > > <ankita@nvidia.com> wrote: > > > From: Ankit Agrawal <ankita@nvidia.com> > > > > > > Currently, KVM for ARM64 maps at stage 2 memory that is considered > device > > > (i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting > > > overrides (as per the ARM architecture [1]) any device MMIO mapping > > > present at stage 1, resulting in a set-up whereby a guest operating > > > system cannot determine device MMIO mapping memory attributes on > its > > > own but it is always overridden by the KVM stage 2 default. > [...] > > Despite the considerable increase in the commit message length, a > > number of questions are left unanswered: > > > > - Shameer reported a regression on non-FWB systems, breaking device > > assignment: > > > > > https://lore.kernel.org/all/af13ed63dc9a4f26a6c958ebfa77d78a@huawei.co > m/ > > This referred to the first patch in the old series which was trying to > make the Stage 2 cacheable based on the vma attributes. That patch has > been dropped for now. It would be good if Shameer confirms this was the > problem. > Yes, that was related to the first patch. We will soon test this on both FWB and non-FWB platforms and report back. Thanks, Shameer
On Tue, Dec 05, 2023 at 09:05:17AM -0400, Jason Gunthorpe wrote: > On Tue, Dec 05, 2023 at 11:40:47AM +0000, Catalin Marinas wrote: > > > - Will had unanswered questions in another part of the thread: > > > > > > https://lore.kernel.org/all/20231013092954.GB13524@willie-the-truck/ > > > > > > Can someone please help concluding it? > > > > Is this about reclaiming the device? I think we concluded that we can't > > generalise this beyond PCIe, though not sure there was any formal > > statement to that thread. The other point Will had was around stating > > in the commit message why we only relax this to Normal NC. I haven't > > checked the commit message yet, it needs careful reading ;). > > Not quite, we said reclaiming is VFIO's problem and if VFIO can't > reliably reclaim a device it shouldn't create it in the first place. I think that as far as device reclaiming was concerned the question posed was related to memory attributes of transactions for guest mappings and the related grouping/ordering with device reset MMIO transactions - it was not (or wasn't only) about error containment. Thanks, Lorenzo
On Tue, Dec 05, 2023 at 03:37:13PM +0100, Lorenzo Pieralisi wrote: > On Tue, Dec 05, 2023 at 09:05:17AM -0400, Jason Gunthorpe wrote: > > On Tue, Dec 05, 2023 at 11:40:47AM +0000, Catalin Marinas wrote: > > > > - Will had unanswered questions in another part of the thread: > > > > > > > > https://lore.kernel.org/all/20231013092954.GB13524@willie-the-truck/ > > > > > > > > Can someone please help concluding it? > > > > > > Is this about reclaiming the device? I think we concluded that we can't > > > generalise this beyond PCIe, though not sure there was any formal > > > statement to that thread. The other point Will had was around stating > > > in the commit message why we only relax this to Normal NC. I haven't > > > checked the commit message yet, it needs careful reading ;). > > > > Not quite, we said reclaiming is VFIO's problem and if VFIO can't > > reliably reclaim a device it shouldn't create it in the first place. > > I think that as far as device reclaiming was concerned the question > posed was related to memory attributes of transactions for guest > mappings and the related grouping/ordering with device reset MMIO > transactions - it was not (or wasn't only) about error containment. Yes. It is VFIO that issues the reset, it is VFIO that must provide the ordering under the assumption that NORMAL_NC was used. Jason
On Tue, Dec 05, 2023 at 09:05:17AM -0400, Jason Gunthorpe wrote: > On Tue, Dec 05, 2023 at 11:40:47AM +0000, Catalin Marinas wrote: > > > - Will had unanswered questions in another part of the thread: > > > > > > https://lore.kernel.org/all/20231013092954.GB13524@willie-the-truck/ > > > > > > Can someone please help concluding it? > > > > Is this about reclaiming the device? I think we concluded that we can't > > generalise this beyond PCIe, though not sure there was any formal > > statement to that thread. The other point Will had was around stating > > in the commit message why we only relax this to Normal NC. I haven't > > checked the commit message yet, it needs careful reading ;). > > Not quite, we said reclaiming is VFIO's problem and if VFIO can't > reliably reclaim a device it shouldn't create it in the first place. > > Again, I think alot of this is trying to take VFIO problems into KVM. > > VFIO devices should not exist if they pose a harm to the system. If > VFIO decided to create the devices anyhow (eg admin override or > something) then it is not KVM's job to do any further enforcement. Yeah, I made this argument in the past. But it's a fair question to ask since the Arm world is different from x86. Just reusing an existing driver in a different context may break its expectations. Does Normal NC access complete by the time a TLBI (for Stage 2) and DSB (DVMsync) is completed? It does reach some point of serialisation with subsequent accesses to the same address but not sure how it is ordered with an access to a different location like the config space used for reset. Maybe it's not a problem at all or it is safe only for PCIe but it would be good to get to the bottom of this. Device-nGnRnE has some stronger rules around end-point completion and that's what the vfio-pci uses. KVM, however, went for the slightly more relaxed nGnRE variant which, at least per the Arm ARM, doesn't have these guarantees. > Remember, the feedback we got from the CPU architects was that even > DEVICE_* will experience an uncontained failure if the device tiggers > an error response in shipping ARM IP. > > The reason PCIe is safe is because the PCI bridge does not generate > errors in the first place! That's an argument to restrict this feature to PCIe. It's really about fewer arguments on the behaviour of other devices. Marc did raise another issue with the GIC VCPU interface (does this even have a vma in the host VMM?). That's a class of devices where the mapping is context-switched, so the TLBI+DSB rules don't help. > Thus, the way a platform device can actually be safe is if it too > never generates errors in the first place! Obviously this approach > works just as well with NORMAL_NC. > > If a platform device does generate errors then we shouldn't expect > containment at all, and the memory type has no bearing on the > safety. The correct answer is to block these platform devices from > VFIO/KVM/etc because they can trigger uncontained failures. Assuming the error containment is sorted, there are two other issues with other types of devices: 1. Ordering guarantees on reclaim or context switch 2. Unaligned accesses On (2), I think PCIe is fairly clear on how the TLPs are generated, so I wouldn't expect additional errors here. But I have no idea what AMBA/AXI does here in general. Perhaps it's fine, I don't think we looked into it as the focus was mostly on PCIe. So, I think it would be easier to get this patch upstream if we limit the change to PCIe devices for now. We may relax this further in the future. Do you actually have a need for non-PCIe devices to support WC in the guest or it's more about the complexity of the logic to detect whether it's actually a PCIe BAR we are mapping into the guest? (I can see some Arm GPU folk asking for this but those devices are not easily virtualisable).
On Tue, Dec 05, 2023 at 10:44:17AM -0400, Jason Gunthorpe wrote: > On Tue, Dec 05, 2023 at 03:37:13PM +0100, Lorenzo Pieralisi wrote: > > On Tue, Dec 05, 2023 at 09:05:17AM -0400, Jason Gunthorpe wrote: > > > On Tue, Dec 05, 2023 at 11:40:47AM +0000, Catalin Marinas wrote: > > > > > - Will had unanswered questions in another part of the thread: > > > > > > > > > > https://lore.kernel.org/all/20231013092954.GB13524@willie-the-truck/ > > > > > > > > > > Can someone please help concluding it? > > > > > > > > Is this about reclaiming the device? I think we concluded that we can't > > > > generalise this beyond PCIe, though not sure there was any formal > > > > statement to that thread. The other point Will had was around stating > > > > in the commit message why we only relax this to Normal NC. I haven't > > > > checked the commit message yet, it needs careful reading ;). > > > > > > Not quite, we said reclaiming is VFIO's problem and if VFIO can't > > > reliably reclaim a device it shouldn't create it in the first place. > > > > I think that as far as device reclaiming was concerned the question > > posed was related to memory attributes of transactions for guest > > mappings and the related grouping/ordering with device reset MMIO > > transactions - it was not (or wasn't only) about error containment. > > Yes. It is VFIO that issues the reset, it is VFIO that must provide > the ordering under the assumption that NORMAL_NC was used. And does it? Because VFIO so far only assumes Device-nGnRnE. Do we need to address this first before attempting to change KVM? Sorry, just questions, trying to clear the roadblocks.
On Tue, Dec 05, 2023 at 04:22:33PM +0000, Catalin Marinas wrote: > Yeah, I made this argument in the past. But it's a fair question to ask > since the Arm world is different from x86. Just reusing an existing > driver in a different context may break its expectations. Does Normal NC > access complete by the time a TLBI (for Stage 2) and DSB (DVMsync) is > completed? It does reach some point of serialisation with subsequent > accesses to the same address but not sure how it is ordered with an > access to a different location like the config space used for reset. > Maybe it's not a problem at all or it is safe only for PCIe but it would > be good to get to the bottom of this. IMHO, the answer is you can't know architecturally. The specific vfio-platform driver must do an analysis of it's specific SOC and determine what exactly is required to order the reset. The primary purpose of the vfio-platform drivers is to provide this reset! In most cases I would expect some reads from the device to be required before the reset. > > Remember, the feedback we got from the CPU architects was that even > > DEVICE_* will experience an uncontained failure if the device tiggers > > an error response in shipping ARM IP. > > > > The reason PCIe is safe is because the PCI bridge does not generate > > errors in the first place! > > That's an argument to restrict this feature to PCIe. It's really about > fewer arguments on the behaviour of other devices. Marc did raise > another issue with the GIC VCPU interface (does this even have a vma in > the host VMM?). That's a class of devices where the mapping is > context-switched, so the TLBI+DSB rules don't help. I don't know anything about the GIC VCPU interface, to give any comment unfortunately. Since it seems there is something to fix here I would appreciate some background.. When you say it is context switched do you mean kvm does a register write on every vm entry to set the proper HW context for the vCPU? We are worrying that register write will possibly not order after NORMAL_NC? > > Thus, the way a platform device can actually be safe is if it too > > never generates errors in the first place! Obviously this approach > > works just as well with NORMAL_NC. > > > > If a platform device does generate errors then we shouldn't expect > > containment at all, and the memory type has no bearing on the > > safety. The correct answer is to block these platform devices from > > VFIO/KVM/etc because they can trigger uncontained failures. > > Assuming the error containment is sorted, there are two other issues > with other types of devices: > > 1. Ordering guarantees on reclaim or context switch Solved in VFIO > 2. Unaligned accesses > > On (2), I think PCIe is fairly clear on how the TLPs are generated, so I > wouldn't expect additional errors here. But I have no idea what AMBA/AXI > does here in general. Perhaps it's fine, I don't think we looked into it > as the focus was mostly on PCIe. I would expect AXI devices to throw errors in all sorts of odd cases. eg I would not be surprised at all to see carelessly built AXI devices error if ST64B is pointed at them. At least when I was building AXI logic years ago I was so lazy :P This is mostly my point - if the devices under vfio-platform were not designed to have contained errors then, IMHO, it is hard to believe that DEVICE_X/NORMAL_NC is the only issue in that HW. > So, I think it would be easier to get this patch upstream if we limit > the change to PCIe devices for now. We may relax this further in the > future. Do you actually have a need for non-PCIe devices to support WC > in the guest or it's more about the complexity of the logic to detect > whether it's actually a PCIe BAR we are mapping into the guest? (I can > see some Arm GPU folk asking for this but those devices are not easily > virtualisable). The complexity is my concern, and the disruption to the ecosystem with some of the ideas given. If there was a trivial way to convey in the VMA that it is safe then sure, no objection from me. My worry is this has turned from fixing a real problem we have today into a debate about theoretical issues that nobody may care about that are very disruptive to solve. I would turn it around and ask we find a way to restrict platform devices when someone comes with a platform device that wants to use secure kvm and has a single well defined HW problem that is solved by this work. What if we change vfio-pci to use pgprot_device() like it already really should and say the pgprot_noncached() is enforced as DEVICE_nGnRnE and pgprot_device() may be DEVICE_nGnRE or NORMAL_NC? Would that be acceptable? Jason
On Tue, 05 Dec 2023 16:43:18 +0000, Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Tue, Dec 05, 2023 at 04:22:33PM +0000, Catalin Marinas wrote: > > That's an argument to restrict this feature to PCIe. It's really about > > fewer arguments on the behaviour of other devices. Marc did raise > > another issue with the GIC VCPU interface (does this even have a vma in > > the host VMM?). That's a class of devices where the mapping is > > context-switched, so the TLBI+DSB rules don't help. There is no vma. The CPU interface is entirely under control of KVM. Userspace only provides the IPA for the mapping. > > I don't know anything about the GIC VCPU interface, to give any > comment unfortunately. Since it seems there is something to fix here I > would appreciate some background.. > > When you say it is context switched do you mean kvm does a register > write on every vm entry to set the proper HW context for the vCPU? The CPU interface is mapped in every guest S2 page tables as a per-CPU device, and under complete control of the guest. There is no KVM register write to that frame (unless we're proxying it, but that's for another day). > > We are worrying that register write will possibly not order after > NORMAL_NC? Guest maps the device as Normal-NC (because it now can), which means that there is no control over the alignment or anything like that. The accesses could also be reordered, and/or hit after a context switch to another guest. Which is why KVM has so far used nGnRE as the mapping type. M.
On Tue, Dec 05, 2023 at 04:24:22PM +0000, Catalin Marinas wrote: > On Tue, Dec 05, 2023 at 10:44:17AM -0400, Jason Gunthorpe wrote: > > On Tue, Dec 05, 2023 at 03:37:13PM +0100, Lorenzo Pieralisi wrote: > > > On Tue, Dec 05, 2023 at 09:05:17AM -0400, Jason Gunthorpe wrote: > > > > On Tue, Dec 05, 2023 at 11:40:47AM +0000, Catalin Marinas wrote: > > > > > > - Will had unanswered questions in another part of the thread: > > > > > > > > > > > > https://lore.kernel.org/all/20231013092954.GB13524@willie-the-truck/ > > > > > > > > > > > > Can someone please help concluding it? > > > > > > > > > > Is this about reclaiming the device? I think we concluded that we can't > > > > > generalise this beyond PCIe, though not sure there was any formal > > > > > statement to that thread. The other point Will had was around stating > > > > > in the commit message why we only relax this to Normal NC. I haven't > > > > > checked the commit message yet, it needs careful reading ;). > > > > > > > > Not quite, we said reclaiming is VFIO's problem and if VFIO can't > > > > reliably reclaim a device it shouldn't create it in the first place. > > > > > > I think that as far as device reclaiming was concerned the question > > > posed was related to memory attributes of transactions for guest > > > mappings and the related grouping/ordering with device reset MMIO > > > transactions - it was not (or wasn't only) about error containment. > > > > Yes. It is VFIO that issues the reset, it is VFIO that must provide > > the ordering under the assumption that NORMAL_NC was used. > > And does it? Because VFIO so far only assumes Device-nGnRnE. Do we need > to address this first before attempting to change KVM? Sorry, just > questions, trying to clear the roadblocks. There is no way to know. It is SOC specific what would be needed. Could people have implemented their platform devices with a multi-path bus architecture for the reset? Yes, definately. In fact, I've built things like that. Low speed stuff like reset gets its own low speed bus. If that was done will NORMAL_NC vs DEVICE_nGnRnE make a difference? I'm not sure. It depends a lot on how the SOC was designed and how transactions flow on the high speed side. Posting writes, like PCIe does, would destroy any practical ordering difference between the two memory types. If the writes are not posted then the barriers in the TLBI sequence should order it. Fortunately, if some SOC has this issue we know how to solve it - you must do flushing reads on all the multi-path interconnect segments to serialize everything around the reset. Regardless, getting this wrong is not a functional issue, it causes a subtle time sensitive security race with VFIO close() that would be hard to actually hit, and would already require privilege to open a VFIO device to exploit. IOW, we don't care. Jason
On Tue, Dec 05, 2023 at 05:01:28PM +0000, Marc Zyngier wrote: > On Tue, 05 Dec 2023 16:43:18 +0000, > Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Tue, Dec 05, 2023 at 04:22:33PM +0000, Catalin Marinas wrote: > > > That's an argument to restrict this feature to PCIe. It's really about > > > fewer arguments on the behaviour of other devices. Marc did raise > > > another issue with the GIC VCPU interface (does this even have a vma in > > > the host VMM?). That's a class of devices where the mapping is > > > context-switched, so the TLBI+DSB rules don't help. > > There is no vma. The CPU interface is entirely under control of KVM. > Userspace only provides the IPA for the mapping. That's good to know. We can solve the GIC issue by limiting the relaxation to those mappings that have a user vma. Ideally we should do this for vfio only but we don't have an easy way to convey this to KVM.
On Tue, 05 Dec 2023 17:33:01 +0000, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Tue, Dec 05, 2023 at 05:01:28PM +0000, Marc Zyngier wrote: > > On Tue, 05 Dec 2023 16:43:18 +0000, > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Tue, Dec 05, 2023 at 04:22:33PM +0000, Catalin Marinas wrote: > > > > That's an argument to restrict this feature to PCIe. It's really about > > > > fewer arguments on the behaviour of other devices. Marc did raise > > > > another issue with the GIC VCPU interface (does this even have a vma in > > > > the host VMM?). That's a class of devices where the mapping is > > > > context-switched, so the TLBI+DSB rules don't help. > > > > There is no vma. The CPU interface is entirely under control of KVM. > > Userspace only provides the IPA for the mapping. > > That's good to know. We can solve the GIC issue by limiting the > relaxation to those mappings that have a user vma. Yes, this should definitely be part of the decision. > Ideally we should do this for vfio only but we don't have an easy > way to convey this to KVM. But if we want to limit this to PCIe, we'll have to find out. The initial proposal (a long while ago) had a flag conveying some information, and I'd definitely feel more confident having something like that. M.
On Tue, Dec 05, 2023 at 05:50:27PM +0000, Marc Zyngier wrote: > On Tue, 05 Dec 2023 17:33:01 +0000, > Catalin Marinas <catalin.marinas@arm.com> wrote: > > Ideally we should do this for vfio only but we don't have an easy > > way to convey this to KVM. > > But if we want to limit this to PCIe, we'll have to find out. The > initial proposal (a long while ago) had a flag conveying some > information, and I'd definitely feel more confident having something > like that. We can add a VM_PCI_IO in the high vma flags to be set by vfio_pci_core_mmap(), though it limits it to 64-bit architectures. KVM knows this is PCI and relaxes things a bit. It's not generic though if we need this later for something else. A question for Lorenzo: do these BARs appear in iomem_resource? We could search that up instead of a flag, something like the page_is_ram() helper.
On Tue, Dec 05, 2023 at 12:43:18PM -0400, Jason Gunthorpe wrote: > On Tue, Dec 05, 2023 at 04:22:33PM +0000, Catalin Marinas wrote: > > Yeah, I made this argument in the past. But it's a fair question to ask > > since the Arm world is different from x86. Just reusing an existing > > driver in a different context may break its expectations. Does Normal NC > > access complete by the time a TLBI (for Stage 2) and DSB (DVMsync) is > > completed? It does reach some point of serialisation with subsequent > > accesses to the same address but not sure how it is ordered with an > > access to a different location like the config space used for reset. > > Maybe it's not a problem at all or it is safe only for PCIe but it would > > be good to get to the bottom of this. > > IMHO, the answer is you can't know architecturally. The specific > vfio-platform driver must do an analysis of it's specific SOC and > determine what exactly is required to order the reset. The primary > purpose of the vfio-platform drivers is to provide this reset! > > In most cases I would expect some reads from the device to be required > before the reset. I can see in the vfio_platform_common.c code that the reset is either handled by an ACPI _RST method or some custom function in case of DT. Let's consider the ACPI method for now, I assume the AML code pokes some device registers but we can't say much about the ordering it expects without knowing the details. The AML may assume that the ioaddr mapped as Device-nRnRE (ioremap()) in the kernel has the same attributes wherever else is mapped in user or guests. Note that currently the vfio_platform and vfio_pci drivers only allow pgprot_noncached() in user, so they wouldn't worry about other mismatched aliases. I think PCIe is slightly better documented but even here we'll have to rely on the TLBI+DSB to clear any prior writes on different CPUs. It can be argued that it's the responsibility of whoever grants device access to know the details. However, it would help if we give some guidance, any expectations broken if an alias is Normal-NC? It's easier to start with PCIe first until we get some concrete request for other types of devices. > > So, I think it would be easier to get this patch upstream if we limit > > the change to PCIe devices for now. We may relax this further in the > > future. Do you actually have a need for non-PCIe devices to support WC > > in the guest or it's more about the complexity of the logic to detect > > whether it's actually a PCIe BAR we are mapping into the guest? (I can > > see some Arm GPU folk asking for this but those devices are not easily > > virtualisable). > > The complexity is my concern, and the disruption to the ecosystem with > some of the ideas given. > > If there was a trivial way to convey in the VMA that it is safe then > sure, no objection from me. I suggested a new VM_* flag or some way to probe the iomem_resources for PCIe ranges (if they are described in there, not sure). We can invent other tree searching for ranges that get registers from the vfio driver, I don't think it's that difficult. Question is, do we need to do this for other types of devices or it's mostly theoretical at this point (what's theoretical goes both ways really). A more complex way is to change vfio to allow Normal mappings and KVM would mimic them. You were actually planning to do this for Cacheable anyway. > I would turn it around and ask we find a way to restrict platform > devices when someone comes with a platform device that wants to use > secure kvm and has a single well defined HW problem that is solved by > this work. We end up with similar search/validation mechanism, so not sure we gain much. > What if we change vfio-pci to use pgprot_device() like it already > really should and say the pgprot_noncached() is enforced as > DEVICE_nGnRnE and pgprot_device() may be DEVICE_nGnRE or NORMAL_NC? > Would that be acceptable? pgprot_device() needs to stay as Device, otherwise you'd get speculative reads with potential side-effects.
On Tue, Dec 05, 2023 at 07:24:37PM +0000, Catalin Marinas wrote: > On Tue, Dec 05, 2023 at 12:43:18PM -0400, Jason Gunthorpe wrote: > > On Tue, Dec 05, 2023 at 04:22:33PM +0000, Catalin Marinas wrote: > > > Yeah, I made this argument in the past. But it's a fair question to ask > > > since the Arm world is different from x86. Just reusing an existing > > > driver in a different context may break its expectations. Does Normal NC > > > access complete by the time a TLBI (for Stage 2) and DSB (DVMsync) is > > > completed? It does reach some point of serialisation with subsequent > > > accesses to the same address but not sure how it is ordered with an > > > access to a different location like the config space used for reset. > > > Maybe it's not a problem at all or it is safe only for PCIe but it would > > > be good to get to the bottom of this. > > > > IMHO, the answer is you can't know architecturally. The specific > > vfio-platform driver must do an analysis of it's specific SOC and > > determine what exactly is required to order the reset. The primary > > purpose of the vfio-platform drivers is to provide this reset! > > > > In most cases I would expect some reads from the device to be required > > before the reset. > > I can see in the vfio_platform_common.c code that the reset is either > handled by an ACPI _RST method or some custom function in case of DT. > Let's consider the ACPI method for now, I assume the AML code pokes some > device registers but we can't say much about the ordering it expects > without knowing the details. The AML may assume that the ioaddr mapped > as Device-nRnRE (ioremap()) in the kernel has the same attributes > wherever else is mapped in user or guests. Note that currently the > vfio_platform and vfio_pci drivers only allow pgprot_noncached() in > user, so they wouldn't worry about other mismatched aliases. AML is OS agnostic. It technically shouldn't assume the OS never used NORMAL_NC to access the device. By the time the AML is invoked the VMAs are all revoked and all TLBs are flushed so I think the mismatched alias problem is gone?? > It can be argued that it's the responsibility of whoever grants device > access to know the details. However, it would help if we give some > guidance, any expectations broken if an alias is Normal-NC? It's easier > to start with PCIe first until we get some concrete request for other > types of devices. The issue was about NORMAL_NC access prior to the TLBI continuing to float in the interconnect and not be ordered strictly before the reset write. Which, IMHO, is basically arguing that DSB doesn't order these operations on some specific SOC - which I view with some doubt. > > The complexity is my concern, and the disruption to the ecosystem with > > some of the ideas given. > > > > If there was a trivial way to convey in the VMA that it is safe then > > sure, no objection from me. > > I suggested a new VM_* flag or some way to probe the iomem_resources for > PCIe ranges (if they are described in there, not sure). We can invent > other tree searching for ranges that get registers from the vfio driver, > I don't think it's that difficult. I do not like iomem_resource probing on principle :( A VM_ flag would be fine, but seems wasteful. > A more complex way is to change vfio to allow Normal mappings and KVM > would mimic them. You were actually planning to do this for Cacheable > anyway. This needs qemu changes, so I very much don't like it. Cachable is always cachable, it is different thing. > > I would turn it around and ask we find a way to restrict platform > > devices when someone comes with a platform device that wants to use > > secure kvm and has a single well defined HW problem that is solved by > > this work. > > We end up with similar search/validation mechanism, so not sure we gain > much. We gain by not doing it, ever. I'm expecting nobody will ever bring it up. The vfio-platform drivers all look to be rather old to me, and the manifestation is, at worst, that a hostile VM that didn't crash the machine before does crash it now. I have serious doubts anyone will run into an issue. > > What if we change vfio-pci to use pgprot_device() like it already > > really should and say the pgprot_noncached() is enforced as > > DEVICE_nGnRnE and pgprot_device() may be DEVICE_nGnRE or NORMAL_NC? > > Would that be acceptable? > > pgprot_device() needs to stay as Device, otherwise you'd get speculative > reads with potential side-effects. I do not mean to change pgprot_device() I mean to detect the difference via pgprot_device() vs pgprot_noncached(). They put a different value in the PTE that we can sense. It is very hacky. Jason
> -----Original Message----- > From: Shameerali Kolothum Thodi > Sent: Tuesday, December 5, 2023 2:17 PM > To: 'Catalin Marinas' <catalin.marinas@arm.com>; Marc Zyngier > <maz@kernel.org> > Cc: ankita@nvidia.com; jgg@nvidia.com; oliver.upton@linux.dev; > suzuki.poulose@arm.com; yuzenghui <yuzenghui@huawei.com>; > will@kernel.org; ardb@kernel.org; akpm@linux-foundation.org; > gshan@redhat.com; aniketa@nvidia.com; cjia@nvidia.com; > kwankhede@nvidia.com; targupta@nvidia.com; vsethi@nvidia.com; > acurrid@nvidia.com; apopple@nvidia.com; jhubbard@nvidia.com; > danw@nvidia.com; mochs@nvidia.com; kvmarm@lists.linux.dev; > kvm@vger.kernel.org; lpieralisi@kernel.org; linux-kernel@vger.kernel.org; > linux-arm-kernel@lists.infradead.org; linyufeng (A) > <linyufeng3@huawei.com> > Subject: RE: [PATCH v2 1/1] KVM: arm64: allow the VM to select DEVICE_* > and NORMAL_NC for IO memory > > > > > -----Original Message----- > > From: Catalin Marinas <catalin.marinas@arm.com> > > Sent: Tuesday, December 5, 2023 11:41 AM > > To: Marc Zyngier <maz@kernel.org> > > Cc: ankita@nvidia.com; Shameerali Kolothum Thodi > > <shameerali.kolothum.thodi@huawei.com>; jgg@nvidia.com; > > oliver.upton@linux.dev; suzuki.poulose@arm.com; yuzenghui > > <yuzenghui@huawei.com>; will@kernel.org; ardb@kernel.org; > akpm@linux- > > foundation.org; gshan@redhat.com; aniketa@nvidia.com; > cjia@nvidia.com; > > kwankhede@nvidia.com; targupta@nvidia.com; vsethi@nvidia.com; > > acurrid@nvidia.com; apopple@nvidia.com; jhubbard@nvidia.com; > > danw@nvidia.com; mochs@nvidia.com; kvmarm@lists.linux.dev; > > kvm@vger.kernel.org; lpieralisi@kernel.org; linux-kernel@vger.kernel.org; > > linux-arm-kernel@lists.infradead.org > > Subject: Re: [PATCH v2 1/1] KVM: arm64: allow the VM to select DEVICE_* > > and NORMAL_NC for IO memory > > > > + Lorenzo, he really needs to be on this thread. > > > > On Tue, Dec 05, 2023 at 09:21:28AM +0000, Marc Zyngier wrote: > > > On Tue, 05 Dec 2023 03:30:15 +0000, > > > <ankita@nvidia.com> wrote: > > > > From: Ankit Agrawal <ankita@nvidia.com> > > > > > > > > Currently, KVM for ARM64 maps at stage 2 memory that is considered > > device > > > > (i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting > > > > overrides (as per the ARM architecture [1]) any device MMIO mapping > > > > present at stage 1, resulting in a set-up whereby a guest operating > > > > system cannot determine device MMIO mapping memory attributes on > > its > > > > own but it is always overridden by the KVM stage 2 default. > > [...] > > > Despite the considerable increase in the commit message length, a > > > number of questions are left unanswered: > > > > > > - Shameer reported a regression on non-FWB systems, breaking device > > > assignment: > > > > > > > > > https://lore.kernel.org/all/af13ed63dc9a4f26a6c958ebfa77d78a@huawei.co > > m/ > > > > This referred to the first patch in the old series which was trying to > > make the Stage 2 cacheable based on the vma attributes. That patch has > > been dropped for now. It would be good if Shameer confirms this was the > > problem. > > > > Yes, that was related to the first patch. We will soon test this on both FWB > and > non-FWB platforms and report back. > Thanks to Yufeng, the test on both FWB and non-FWB platforms with a GPU dev assignment looks fine with this patch. Shameer
On Tue, 05 Dec 2023 18:40:42 +0000, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Tue, Dec 05, 2023 at 05:50:27PM +0000, Marc Zyngier wrote: > > On Tue, 05 Dec 2023 17:33:01 +0000, > > Catalin Marinas <catalin.marinas@arm.com> wrote: > > > Ideally we should do this for vfio only but we don't have an easy > > > way to convey this to KVM. > > > > But if we want to limit this to PCIe, we'll have to find out. The > > initial proposal (a long while ago) had a flag conveying some > > information, and I'd definitely feel more confident having something > > like that. > > We can add a VM_PCI_IO in the high vma flags to be set by > vfio_pci_core_mmap(), though it limits it to 64-bit architectures. KVM > knows this is PCI and relaxes things a bit. It's not generic though if > we need this later for something else. Either that, or something actually describing the attributes that VFIO wants. And I very much want it to be a buy-in behaviour, not something that automagically happens and changes the default behaviour for everyone based on some hand-wavy assertions. If that means a userspace change, fine by me. The VMM better know what is happening. M.
On Tue, Dec 05, 2023 at 06:40:42PM +0000, Catalin Marinas wrote: > On Tue, Dec 05, 2023 at 05:50:27PM +0000, Marc Zyngier wrote: > > On Tue, 05 Dec 2023 17:33:01 +0000, > > Catalin Marinas <catalin.marinas@arm.com> wrote: > > > Ideally we should do this for vfio only but we don't have an easy > > > way to convey this to KVM. > > > > But if we want to limit this to PCIe, we'll have to find out. The > > initial proposal (a long while ago) had a flag conveying some > > information, and I'd definitely feel more confident having something > > like that. > > We can add a VM_PCI_IO in the high vma flags to be set by > vfio_pci_core_mmap(), though it limits it to 64-bit architectures. KVM > knows this is PCI and relaxes things a bit. It's not generic though if > we need this later for something else. > > A question for Lorenzo: do these BARs appear in iomem_resource? We could > search that up instead of a flag, something like the page_is_ram() > helper. They do but an iomem_resource look-up alone if I am not mistaken would not tell us if it is PCI address space, we would need additional checks (like detecting if we are decoding an address within a PCI host bridge windows) to determine that if we go down this route. Lorenzo
On Wed, Dec 06, 2023 at 11:39:03AM +0000, Marc Zyngier wrote: > On Tue, 05 Dec 2023 18:40:42 +0000, > Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Tue, Dec 05, 2023 at 05:50:27PM +0000, Marc Zyngier wrote: > > > On Tue, 05 Dec 2023 17:33:01 +0000, > > > Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > Ideally we should do this for vfio only but we don't have an easy > > > > way to convey this to KVM. > > > > > > But if we want to limit this to PCIe, we'll have to find out. The > > > initial proposal (a long while ago) had a flag conveying some > > > information, and I'd definitely feel more confident having something > > > like that. > > > > We can add a VM_PCI_IO in the high vma flags to be set by > > vfio_pci_core_mmap(), though it limits it to 64-bit architectures. KVM > > knows this is PCI and relaxes things a bit. It's not generic though if > > we need this later for something else. > > Either that, or something actually describing the attributes that VFIO > wants. > > And I very much want it to be a buy-in behaviour, not something that > automagically happens and changes the default behaviour for everyone > based on some hand-wavy assertions. > > If that means a userspace change, fine by me. The VMM better know what > is happening. Driving the attributes from a single point like the VFIO driver is indeed better. The problem is that write-combining on Arm doesn't come without speculative loads, otherwise we would have solved it by now. I also recall the VFIO maintainer pushing back on relaxing the pgprot_noncached() for the user mapping but I don't remember the reasons. We could do with a pgprot_maybewritecombine() or pgprot_writecombinenospec() (similar to Jason's idea but without changing the semantics of pgprot_device()). For the user mapping on arm64 this would be Device (even _GRE) since it can't disable speculation but stage 2 would leave the decision to the guest since the speculative loads aren't much different from committed loads done wrongly. If we want the VMM to drive this entirely, we could add a new mmap() flag like MAP_WRITECOMBINE or PROT_WRITECOMBINE. They do feel a bit weird but there is precedent with PROT_MTE to describe a memory type. One question is whether the VFIO driver still needs to have the knowledge and sanitise the requests from the VMM within a single BAR. If there are no security implications to such mappings, the VMM can map parts of the BAR as pgprot_noncached(), other parts as pgprot_writecombine() and KVM just follows them (similarly if we need a cacheable mapping). The latter has some benefits for DPDK but it's a lot more involved with having to add device-specific knowledge into the VMM. The VMM would also have to present the whole BAR contiguously to the guest even if there are different mapping attributes within the range. So a lot of MAP_FIXED uses. I'd rather leaving this decision with the guest than the VMM, it looks like more hassle to create those mappings. The VMM or the VFIO could only state write-combine and speculation allowed.
On Tue, Dec 05, 2023 at 03:48:22PM -0400, Jason Gunthorpe wrote: > On Tue, Dec 05, 2023 at 07:24:37PM +0000, Catalin Marinas wrote: > > On Tue, Dec 05, 2023 at 12:43:18PM -0400, Jason Gunthorpe wrote: > > > What if we change vfio-pci to use pgprot_device() like it already > > > really should and say the pgprot_noncached() is enforced as > > > DEVICE_nGnRnE and pgprot_device() may be DEVICE_nGnRE or NORMAL_NC? > > > Would that be acceptable? > > > > pgprot_device() needs to stay as Device, otherwise you'd get speculative > > reads with potential side-effects. > > I do not mean to change pgprot_device() I mean to detect the > difference via pgprot_device() vs pgprot_noncached(). They put a > different value in the PTE that we can sense. It is very hacky. Ah, ok, it does look hacky though (as is the alternative of coming up with a new specific pgprot_*() that KVM can treat differently). BTW, on those Mellanox devices that require different attributes within a BAR, do they have a problem with speculative reads causing side-effects? If not, we might as well map the whole BAR in user as Normal NC but have the guest use the appropriate attributes within the BAR. The VMM wouldn't explicitly access the BAR but we'd get the occasional speculative reads.
On Wed, Dec 06, 2023 at 02:49:02PM +0000, Catalin Marinas wrote: > On Tue, Dec 05, 2023 at 03:48:22PM -0400, Jason Gunthorpe wrote: > > On Tue, Dec 05, 2023 at 07:24:37PM +0000, Catalin Marinas wrote: > > > On Tue, Dec 05, 2023 at 12:43:18PM -0400, Jason Gunthorpe wrote: > > > > What if we change vfio-pci to use pgprot_device() like it already > > > > really should and say the pgprot_noncached() is enforced as > > > > DEVICE_nGnRnE and pgprot_device() may be DEVICE_nGnRE or NORMAL_NC? > > > > Would that be acceptable? > > > > > > pgprot_device() needs to stay as Device, otherwise you'd get speculative > > > reads with potential side-effects. > > > > I do not mean to change pgprot_device() I mean to detect the > > difference via pgprot_device() vs pgprot_noncached(). They put a > > different value in the PTE that we can sense. It is very hacky. > > Ah, ok, it does look hacky though (as is the alternative of coming up > with a new specific pgprot_*() that KVM can treat differently). > > BTW, on those Mellanox devices that require different attributes within > a BAR, do they have a problem with speculative reads causing > side-effects? Yes. We definitely have had that problem in the past on older devices. VFIO must map the BAR using pgprot_device/noncached() into the VMM, no other choice is functionally OK. Only some pages can safely tolerate speculative reads and the guest driver knows which they are. Jason
On Wed, Dec 06, 2023 at 12:14:18PM +0000, Catalin Marinas wrote: > We could do with a pgprot_maybewritecombine() or > pgprot_writecombinenospec() (similar to Jason's idea but without > changing the semantics of pgprot_device()). For the user mapping on > arm64 this would be Device (even _GRE) since it can't disable > speculation but stage 2 would leave the decision to the guest since the > speculative loads aren't much different from committed loads done > wrongly. This would be fine, as would a VMA flag. Please pick one :) I think a VMA flag is simpler than messing with pgprot. > If we want the VMM to drive this entirely, we could add a new mmap() > flag like MAP_WRITECOMBINE or PROT_WRITECOMBINE. They do feel a bit As in the other thread, we cannot unconditionally map NORMAL_NC into the VMM. > The latter has some benefits for DPDK but it's a lot more involved > with DPDK WC support will be solved with some VFIO-only change if anyone ever cares to make it, if that is what you mean. > having to add device-specific knowledge into the VMM. The VMM would also > have to present the whole BAR contiguously to the guest even if there > are different mapping attributes within the range. So a lot of MAP_FIXED > uses. I'd rather leaving this decision with the guest than the VMM, it > looks like more hassle to create those mappings. The VMM or the VFIO > could only state write-combine and speculation allowed. We talked about this already, the guest must decide, the VMM doesn't have the information to pre-predict which pages the guest will want to use WC on. Jason
On Wed, Dec 06, 2023 at 11:05:56AM -0400, Jason Gunthorpe wrote: > On Wed, Dec 06, 2023 at 02:49:02PM +0000, Catalin Marinas wrote: > > On Tue, Dec 05, 2023 at 03:48:22PM -0400, Jason Gunthorpe wrote: > > > On Tue, Dec 05, 2023 at 07:24:37PM +0000, Catalin Marinas wrote: > > > > On Tue, Dec 05, 2023 at 12:43:18PM -0400, Jason Gunthorpe wrote: > > > > > What if we change vfio-pci to use pgprot_device() like it already > > > > > really should and say the pgprot_noncached() is enforced as > > > > > DEVICE_nGnRnE and pgprot_device() may be DEVICE_nGnRE or NORMAL_NC? > > > > > Would that be acceptable? > > > > > > > > pgprot_device() needs to stay as Device, otherwise you'd get speculative > > > > reads with potential side-effects. > > > > > > I do not mean to change pgprot_device() I mean to detect the > > > difference via pgprot_device() vs pgprot_noncached(). They put a > > > different value in the PTE that we can sense. It is very hacky. > > > > Ah, ok, it does look hacky though (as is the alternative of coming up > > with a new specific pgprot_*() that KVM can treat differently). > > > > BTW, on those Mellanox devices that require different attributes within > > a BAR, do they have a problem with speculative reads causing > > side-effects? > > Yes. We definitely have had that problem in the past on older > devices. VFIO must map the BAR using pgprot_device/noncached() into > the VMM, no other choice is functionally OK. Were those BARs tagged as prefetchable or non-prefetchable ? I assume the latter but please let me know if I am guessing wrong. Thanks, Lorenzo
On Wed, Dec 06, 2023 at 04:18:05PM +0100, Lorenzo Pieralisi wrote: > On Wed, Dec 06, 2023 at 11:05:56AM -0400, Jason Gunthorpe wrote: > > On Wed, Dec 06, 2023 at 02:49:02PM +0000, Catalin Marinas wrote: > > > On Tue, Dec 05, 2023 at 03:48:22PM -0400, Jason Gunthorpe wrote: > > > > On Tue, Dec 05, 2023 at 07:24:37PM +0000, Catalin Marinas wrote: > > > > > On Tue, Dec 05, 2023 at 12:43:18PM -0400, Jason Gunthorpe wrote: > > > > > > What if we change vfio-pci to use pgprot_device() like it already > > > > > > really should and say the pgprot_noncached() is enforced as > > > > > > DEVICE_nGnRnE and pgprot_device() may be DEVICE_nGnRE or NORMAL_NC? > > > > > > Would that be acceptable? > > > > > > > > > > pgprot_device() needs to stay as Device, otherwise you'd get speculative > > > > > reads with potential side-effects. > > > > > > > > I do not mean to change pgprot_device() I mean to detect the > > > > difference via pgprot_device() vs pgprot_noncached(). They put a > > > > different value in the PTE that we can sense. It is very hacky. > > > > > > Ah, ok, it does look hacky though (as is the alternative of coming up > > > with a new specific pgprot_*() that KVM can treat differently). > > > > > > BTW, on those Mellanox devices that require different attributes within > > > a BAR, do they have a problem with speculative reads causing > > > side-effects? > > > > Yes. We definitely have had that problem in the past on older > > devices. VFIO must map the BAR using pgprot_device/noncached() into > > the VMM, no other choice is functionally OK. > > Were those BARs tagged as prefetchable or non-prefetchable ? I assume the > latter but please let me know if I am guessing wrong. I don't know it was quite old HW. Probably. Just because a BAR is not marked as prefetchable doesn't mean that the device can't use NORMAL_NC on subsets of it. Jason
On Wed, Dec 06, 2023 at 11:38:09AM -0400, Jason Gunthorpe wrote: > On Wed, Dec 06, 2023 at 04:18:05PM +0100, Lorenzo Pieralisi wrote: > > On Wed, Dec 06, 2023 at 11:05:56AM -0400, Jason Gunthorpe wrote: > > > On Wed, Dec 06, 2023 at 02:49:02PM +0000, Catalin Marinas wrote: > > > > BTW, on those Mellanox devices that require different attributes within > > > > a BAR, do they have a problem with speculative reads causing > > > > side-effects? > > > > > > Yes. We definitely have had that problem in the past on older > > > devices. VFIO must map the BAR using pgprot_device/noncached() into > > > the VMM, no other choice is functionally OK. > > > > Were those BARs tagged as prefetchable or non-prefetchable ? I assume the > > latter but please let me know if I am guessing wrong. > > I don't know it was quite old HW. Probably. > > Just because a BAR is not marked as prefetchable doesn't mean that the > device can't use NORMAL_NC on subsets of it. What about the other way around - would we have a prefetchable BAR that has portions which are unprefetchable?
On Wed, Dec 06, 2023 at 11:16:03AM -0400, Jason Gunthorpe wrote: > On Wed, Dec 06, 2023 at 12:14:18PM +0000, Catalin Marinas wrote: > > We could do with a pgprot_maybewritecombine() or > > pgprot_writecombinenospec() (similar to Jason's idea but without > > changing the semantics of pgprot_device()). For the user mapping on > > arm64 this would be Device (even _GRE) since it can't disable > > speculation but stage 2 would leave the decision to the guest since the > > speculative loads aren't much different from committed loads done > > wrongly. > > This would be fine, as would a VMA flag. Please pick one :) > > I think a VMA flag is simpler than messing with pgprot. I guess one could write a patch and see how it goes ;). > > If we want the VMM to drive this entirely, we could add a new mmap() > > flag like MAP_WRITECOMBINE or PROT_WRITECOMBINE. They do feel a bit > > As in the other thread, we cannot unconditionally map NORMAL_NC into > the VMM. I'm not suggesting this but rather the VMM map portions of the BAR with either Device or Normal-NC, concatenate them (MAP_FIXED) and pass this range as a memory slot (or multiple if a slot doesn't allow multiple vmas). > > The latter has some benefits for DPDK but it's a lot more involved > > with > > DPDK WC support will be solved with some VFIO-only change if anyone > ever cares to make it, if that is what you mean. Yeah. Some arguments I've heard in private and public discussions is that the KVM device pass-through shouldn't be different from the DPDK case. So fixing that would cover KVM as well, though we'd need additional logic in the VMM. BenH had a short talk at Plumbers around this - https://youtu.be/QLvN3KXCn0k?t=7010. There was some statement in there that for x86, the guests are allowed to do WC without other KVM restrictions (not sure whether that's the case, not familiar with it). > > having to add device-specific knowledge into the VMM. The VMM would also > > have to present the whole BAR contiguously to the guest even if there > > are different mapping attributes within the range. So a lot of MAP_FIXED > > uses. I'd rather leaving this decision with the guest than the VMM, it > > looks like more hassle to create those mappings. The VMM or the VFIO > > could only state write-combine and speculation allowed. > > We talked about this already, the guest must decide, the VMM doesn't > have the information to pre-predict which pages the guest will want to > use WC on. Are the Device/Normal offsets within a BAR fixed, documented in e.g. the spec or this is something configurable via some MMIO that the guest does.
On Wed, Dec 06, 2023 at 04:23:25PM +0000, Catalin Marinas wrote: > On Wed, Dec 06, 2023 at 11:38:09AM -0400, Jason Gunthorpe wrote: > > On Wed, Dec 06, 2023 at 04:18:05PM +0100, Lorenzo Pieralisi wrote: > > > On Wed, Dec 06, 2023 at 11:05:56AM -0400, Jason Gunthorpe wrote: > > > > On Wed, Dec 06, 2023 at 02:49:02PM +0000, Catalin Marinas wrote: > > > > > BTW, on those Mellanox devices that require different attributes within > > > > > a BAR, do they have a problem with speculative reads causing > > > > > side-effects? > > > > > > > > Yes. We definitely have had that problem in the past on older > > > > devices. VFIO must map the BAR using pgprot_device/noncached() into > > > > the VMM, no other choice is functionally OK. > > > > > > Were those BARs tagged as prefetchable or non-prefetchable ? I assume the > > > latter but please let me know if I am guessing wrong. > > > > I don't know it was quite old HW. Probably. > > > > Just because a BAR is not marked as prefetchable doesn't mean that the > > device can't use NORMAL_NC on subsets of it. > > What about the other way around - would we have a prefetchable BAR that > has portions which are unprefetchable? I would say possibly. Prefetch is a dead concept in PCIe, it was obsoleted in PCI-X about 20 years ago. No PCIe system has ever done prefetch. There is a strong incentive to mark BAR's as prefetchable because it allows 64 bit addressing in configurations with bridges. So.. I would expect people have done interesting things here. Jason
On Wed, Dec 06, 2023 at 04:31:48PM +0000, Catalin Marinas wrote: > > This would be fine, as would a VMA flag. Please pick one :) > > > > I think a VMA flag is simpler than messing with pgprot. > > I guess one could write a patch and see how it goes ;). A lot of patches have been sent on this already :( > > > If we want the VMM to drive this entirely, we could add a new mmap() > > > flag like MAP_WRITECOMBINE or PROT_WRITECOMBINE. They do feel a bit > > > > As in the other thread, we cannot unconditionally map NORMAL_NC into > > the VMM. > > I'm not suggesting this but rather the VMM map portions of the BAR with > either Device or Normal-NC, concatenate them (MAP_FIXED) and pass this > range as a memory slot (or multiple if a slot doesn't allow multiple > vmas). The VMM can't know what to do. We already talked about this. The VMM cannot be involved in the decision to make pages NORMAL_NC or not. That idea ignores how actual devices work. Either the VM decides directly as this patch proposes or the VM does some new generic trap/hypercall to ask the VMM to change it on its behalf. The VMM cannot do it independently. AFAIK nobody wants to see a trap/hypercall solution. That is why we have been exclusively focused on this approach. > > > The latter has some benefits for DPDK but it's a lot more involved > > > with > > > > DPDK WC support will be solved with some VFIO-only change if anyone > > ever cares to make it, if that is what you mean. > > Yeah. Some arguments I've heard in private and public discussions is > that the KVM device pass-through shouldn't be different from the DPDK > case. I strongly disagree with this. The KVM case should be solved without the VMM being aware of what mappings the VM is doing. DPDK is in control and can directly ask VFIO to make the correct pgprot with an ioctl. You can hear Alex also articulate this position in that video. > There was some statement in there that for x86, the guests are > allowed to do WC without other KVM restrictions (not sure whether > that's the case, not familiar with it). x86 has a similar issue (Sean was talking about this and how he wants to fix it) where the VMM can restrict things and on x86 there are configurations where WC does and doesn't work in VM's too. Depends on who made the hypervisor. :( Nobody has pushed hard enough to see it resolved in upstream, but I understand some of the cloud operator set have their own solutions. > > We talked about this already, the guest must decide, the VMM doesn't > > have the information to pre-predict which pages the guest will want to > > use WC on. > > Are the Device/Normal offsets within a BAR fixed, documented in e.g. the > spec or this is something configurable via some MMIO that the guest > does. No, it is fully dynamic on demand with firmware RPCs. Jason
On Wed, Dec 06, 2023 at 01:20:35PM -0400, Jason Gunthorpe wrote: > On Wed, Dec 06, 2023 at 04:31:48PM +0000, Catalin Marinas wrote: > > > This would be fine, as would a VMA flag. Please pick one :) > > > > > > I think a VMA flag is simpler than messing with pgprot. > > > > I guess one could write a patch and see how it goes ;). > > A lot of patches have been sent on this already :( But not one with a VM_* flag. I guess we could also add a VM_VFIO flag which implies KVM has less restrictions on the memory type. I think that's more bike-shedding. The key point is that we don't want to relax this for whatever KVM may map in the guest but only for certain devices. Just having a vma may not be sufficient, we can't tell where that vma came from. So for the vfio bits, completely untested: -------------8<---------------------------- diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 1929103ee59a..b89d2dfcd534 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1863,7 +1863,7 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma * See remap_pfn_range(), called from vfio_pci_fault() but we can't * change vm_flags within the fault handler. Set them now. */ - vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP); + vm_flags_set(vma, VM_VFIO | VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP); vma->vm_ops = &vfio_pci_mmap_ops; return 0; diff --git a/include/linux/mm.h b/include/linux/mm.h index 418d26608ece..6df46fd7836a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -391,6 +391,13 @@ extern unsigned int kobjsize(const void *objp); # define VM_UFFD_MINOR VM_NONE #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */ +#ifdef CONFIG_64BIT +#define VM_VFIO_BIT 39 +#define VM_VFIO BIT(VM_VFIO_BIT) +#else +#define VM_VFIO VM_NONE +#endif + /* Bits set in the VMA until the stack is in its final location */ #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ | VM_STACK_EARLY) -------------8<---------------------------- In KVM, Akita's patch would take this into account, not just rely on "device==true". > > > > If we want the VMM to drive this entirely, we could add a new mmap() > > > > flag like MAP_WRITECOMBINE or PROT_WRITECOMBINE. They do feel a bit > > > > > > As in the other thread, we cannot unconditionally map NORMAL_NC into > > > the VMM. > > > > I'm not suggesting this but rather the VMM map portions of the BAR with > > either Device or Normal-NC, concatenate them (MAP_FIXED) and pass this > > range as a memory slot (or multiple if a slot doesn't allow multiple > > vmas). > > The VMM can't know what to do. We already talked about this. The VMM > cannot be involved in the decision to make pages NORMAL_NC or > not. That idea ignores how actual devices work. [...] > > Are the Device/Normal offsets within a BAR fixed, documented in e.g. the > > spec or this is something configurable via some MMIO that the guest > > does. > > No, it is fully dynamic on demand with firmware RPCs. I think that's a key argument. The VMM cannot, on its own, configure the BAR and figure a way to communicate this to the guest. We could invent some para-virtualisation/trapping mechanism but that's unnecessarily complicated. In the DPDK case, DPDK both configures and interacts with the device. In the VMM/VM case, we need the VM to do this, we can't split the configuration in VMM and interaction with the device in the VM.
On Wed, Dec 06, 2023 at 06:58:44PM +0000, Catalin Marinas wrote: > -------------8<---------------------------- > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > index 1929103ee59a..b89d2dfcd534 100644 > --- a/drivers/vfio/pci/vfio_pci_core.c > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -1863,7 +1863,7 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma > * See remap_pfn_range(), called from vfio_pci_fault() but we can't > * change vm_flags within the fault handler. Set them now. > */ > - vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP); > + vm_flags_set(vma, VM_VFIO | VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP); > vma->vm_ops = &vfio_pci_mmap_ops; > > return 0; > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 418d26608ece..6df46fd7836a 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -391,6 +391,13 @@ extern unsigned int kobjsize(const void *objp); > # define VM_UFFD_MINOR VM_NONE > #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */ > > +#ifdef CONFIG_64BIT > +#define VM_VFIO_BIT 39 > +#define VM_VFIO BIT(VM_VFIO_BIT) > +#else > +#define VM_VFIO VM_NONE > +#endif > + > /* Bits set in the VMA until the stack is in its final location */ > #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ | VM_STACK_EARLY) > -------------8<---------------------------- > > In KVM, Akita's patch would take this into account, not just rely on > "device==true". Yes, Ankit let's try this please. I would not call it VM_VFIO though VM_VFIO_ALLOW_WC ? Introduce it in a separate patch and summarize this thread, with a suggested-by from Catalin :) Cc Sean Christopherson <seanjc@google.com> too as x86 kvm might support the idea > I think that's a key argument. The VMM cannot, on its own, configure the > BAR and figure a way to communicate this to the guest. We could invent > some para-virtualisation/trapping mechanism but that's unnecessarily > complicated. In the DPDK case, DPDK both configures and interacts with > the device. In the VMM/VM case, we need the VM to do this, we can't > split the configuration in VMM and interaction with the device in the > VM. Yes Thanks, Jason
On Wed, Dec 06, 2023 at 03:03:56PM -0400, Jason Gunthorpe wrote: > On Wed, Dec 06, 2023 at 06:58:44PM +0000, Catalin Marinas wrote: > > -------------8<---------------------------- > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > > index 1929103ee59a..b89d2dfcd534 100644 > > --- a/drivers/vfio/pci/vfio_pci_core.c > > +++ b/drivers/vfio/pci/vfio_pci_core.c > > @@ -1863,7 +1863,7 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma > > * See remap_pfn_range(), called from vfio_pci_fault() but we can't > > * change vm_flags within the fault handler. Set them now. > > */ > > - vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP); > > + vm_flags_set(vma, VM_VFIO | VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP); > > vma->vm_ops = &vfio_pci_mmap_ops; > > > > return 0; > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 418d26608ece..6df46fd7836a 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -391,6 +391,13 @@ extern unsigned int kobjsize(const void *objp); > > # define VM_UFFD_MINOR VM_NONE > > #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */ > > > > +#ifdef CONFIG_64BIT > > +#define VM_VFIO_BIT 39 > > +#define VM_VFIO BIT(VM_VFIO_BIT) > > +#else > > +#define VM_VFIO VM_NONE > > +#endif > > + > > /* Bits set in the VMA until the stack is in its final location */ > > #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ | VM_STACK_EARLY) > > -------------8<---------------------------- > > > > In KVM, Akita's patch would take this into account, not just rely on > > "device==true". > > Yes, Ankit let's try this please. I would not call it VM_VFIO though > > VM_VFIO_ALLOW_WC ? Yeah. I don't really care about the name.
>On Wed, Dec 06, 2023 at 03:03:56PM -0400, Jason Gunthorpe wrote: >> On Wed, Dec 06, 2023 at 06:58:44PM +0000, Catalin Marinas wrote: >> > -------------8<---------------------------- >> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c >> > index 1929103ee59a..b89d2dfcd534 100644 >> > --- a/drivers/vfio/pci/vfio_pci_core.c >> > +++ b/drivers/vfio/pci/vfio_pci_core.c >> > @@ -1863,7 +1863,7 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma >> > * See remap_pfn_range(), called from vfio_pci_fault() but we can't >> > * change vm_flags within the fault handler. Set them now. >> > */ >> > - vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP); >> > + vm_flags_set(vma, VM_VFIO | VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP); >> > vma->vm_ops = &vfio_pci_mmap_ops; >> > >> > return 0; >> > diff --git a/include/linux/mm.h b/include/linux/mm.h >> > index 418d26608ece..6df46fd7836a 100644 >> > --- a/include/linux/mm.h >> > +++ b/include/linux/mm.h >> > @@ -391,6 +391,13 @@ extern unsigned int kobjsize(const void *objp); >> > # define VM_UFFD_MINOR VM_NONE >> > #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */ >> > >> > +#ifdef CONFIG_64BIT >> > +#define VM_VFIO_BIT 39 >> > +#define VM_VFIO BIT(VM_VFIO_BIT) >> > +#else >> > +#define VM_VFIO VM_NONE >> > +#endif >> > + >> > /* Bits set in the VMA until the stack is in its final location */ >> > #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ | VM_STACK_EARLY) >> > -------------8<---------------------------- >> > >> > In KVM, Akita's patch would take this into account, not just rely on >> > "device==true". >> >> Yes, Ankit let's try this please. I would not call it VM_VFIO though >> >> VM_VFIO_ALLOW_WC ? > > Yeah. I don't really care about the name. Thanks, I will make the change to set the VM_VFIO_ALLOW_WC flag in vfio. Then make use of "device==true" and presence of the flag to decide on setting the S2 as NORMAL_NC.
On Wed, Dec 06, 2023 at 12:48:02PM -0400, Jason Gunthorpe wrote: > On Wed, Dec 06, 2023 at 04:23:25PM +0000, Catalin Marinas wrote: > > On Wed, Dec 06, 2023 at 11:38:09AM -0400, Jason Gunthorpe wrote: > > > On Wed, Dec 06, 2023 at 04:18:05PM +0100, Lorenzo Pieralisi wrote: > > > > On Wed, Dec 06, 2023 at 11:05:56AM -0400, Jason Gunthorpe wrote: > > > > > On Wed, Dec 06, 2023 at 02:49:02PM +0000, Catalin Marinas wrote: > > > > > > BTW, on those Mellanox devices that require different attributes within > > > > > > a BAR, do they have a problem with speculative reads causing > > > > > > side-effects? > > > > > > > > > > Yes. We definitely have had that problem in the past on older > > > > > devices. VFIO must map the BAR using pgprot_device/noncached() into > > > > > the VMM, no other choice is functionally OK. > > > > > > > > Were those BARs tagged as prefetchable or non-prefetchable ? I assume the > > > > latter but please let me know if I am guessing wrong. > > > > > > I don't know it was quite old HW. Probably. > > > > > > Just because a BAR is not marked as prefetchable doesn't mean that the > > > device can't use NORMAL_NC on subsets of it. > > > > What about the other way around - would we have a prefetchable BAR that > > has portions which are unprefetchable? > > I would say possibly. > > Prefetch is a dead concept in PCIe, it was obsoleted in PCI-X about 20 > years ago. No PCIe system has ever done prefetch. > > There is a strong incentive to mark BAR's as prefetchable because it > allows 64 bit addressing in configurations with bridges. If by strong incentive you mean the "Additional guidance on the Prefetchable Bit in Memory Space BARs" in the PCI express specifications, I think it has been removed from the spec and the criteria that had to be met to implement it were basically impossible to fulfill on ARM systems, it did not make any sense in the first place. I agree on your statement related to the prefetchable concept but I believe that a prefetchable BAR containing regions that have read side-effects is essentially a borked design unless at system level speculative reads are prevented (as far as I understand the implementation note this could only be an endpoint integrated in a system where read speculation can't just happen (?)). Long story short: a PCIe card/device that can be plugged on any PCIe compliant system (x86, ARM or whatever) should not mark a BAR region with memory read side-effects as prefetchable, either that or I don't understand what the implementation note above was all about. AFAIK the prefetchable concept in PCIe is likely to be scrapped altogether in the not too distant future. Anyway, that was just for completeness (and to shed light on the BAR prefetchable bit usage). Lorenzo
On Thu, Dec 07, 2023 at 11:13:52AM +0100, Lorenzo Pieralisi wrote: > > > What about the other way around - would we have a prefetchable BAR that > > > has portions which are unprefetchable? > > > > I would say possibly. > > > > Prefetch is a dead concept in PCIe, it was obsoleted in PCI-X about 20 > > years ago. No PCIe system has ever done prefetch. > > > > There is a strong incentive to mark BAR's as prefetchable because it > > allows 64 bit addressing in configurations with bridges. > > If by strong incentive you mean the "Additional guidance on the > Prefetchable Bit in Memory Space BARs" in the PCI express specifications, > I think it has been removed from the spec and the criteria that had to be > met to implement it were basically impossible to fulfill on ARM systems, > it did not make any sense in the first place. No, I mean many systems don't have room to accommodate large 32 bit BARs and the only real way to make stuff work is to have a 64 bit BAR by setting prefetchable. Given mis-marking a read-side-effect region as prefetchable has no actual consequence on PCI-E I would not be surprised to learn people have done this. > I agree on your statement related to the prefetchable concept but I > believe that a prefetchable BAR containing regions that have > read side-effects is essentially a borked design unless at system level > speculative reads are prevented (as far as I understand the > implementation note this could only be an endpoint integrated in a > system where read speculation can't just happen (?)). IMHO the PCIe concept of prefetchable has no intersection with the CPU. The CPU chooses entirely on its own what rules to apply to the PCI MMIO regions and no OS should drive that decision based on the prefetchable BAR flag. The *only* purpose of the prefetchable flag was to permit a classical 33/66MHz PCI bridge to prefetch reads because the physical bus protocol back then did not include a read length. For any system that does not have an ancient PCI bridge the indication is totally useless. Jason
On Thu, Dec 07, 2023 at 09:38:25AM -0400, Jason Gunthorpe wrote: > On Thu, Dec 07, 2023 at 11:13:52AM +0100, Lorenzo Pieralisi wrote: > > > > What about the other way around - would we have a prefetchable BAR that > > > > has portions which are unprefetchable? > > > > > > I would say possibly. > > > > > > Prefetch is a dead concept in PCIe, it was obsoleted in PCI-X about 20 > > > years ago. No PCIe system has ever done prefetch. > > > > > > There is a strong incentive to mark BAR's as prefetchable because it > > > allows 64 bit addressing in configurations with bridges. > > > > If by strong incentive you mean the "Additional guidance on the > > Prefetchable Bit in Memory Space BARs" in the PCI express specifications, > > I think it has been removed from the spec and the criteria that had to be > > met to implement it were basically impossible to fulfill on ARM systems, > > it did not make any sense in the first place. > > No, I mean many systems don't have room to accommodate large 32 bit > BARs and the only real way to make stuff work is to have a 64 bit BAR > by setting prefetchable. That's what the implementation note I mentioned referred to ;) > Given mis-marking a read-side-effect region as prefetchable has no > actual consequence on PCI-E I would not be surprised to learn people > have done this. PCIe specs 6.1, 7.5.1.2.1 "Base Address Registers" "A function is permitted to mark a range as prefetchable if there are no side effects on reads..." I don't think that an OS should use the prefetchable flag to infer anything (even though we do at the moment -> sysfs mappings, I know that the prefetchable concept is being scrapped from the PCI specs altogether for a reason), I don't see though how we can say that's a SW bug at present given what I quoted above. I'd agree that it is best not to use that flag for new code we are adding (because it will be deprecated soon). Thanks, Lorenzo
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h index cfdf40f734b1..19278dfe7978 100644 --- a/arch/arm64/include/asm/kvm_pgtable.h +++ b/arch/arm64/include/asm/kvm_pgtable.h @@ -197,6 +197,7 @@ enum kvm_pgtable_stage2_flags { * @KVM_PGTABLE_PROT_W: Write permission. * @KVM_PGTABLE_PROT_R: Read permission. * @KVM_PGTABLE_PROT_DEVICE: Device attributes. + * @KVM_PGTABLE_PROT_NORMAL_NC: Normal noncacheable attributes. * @KVM_PGTABLE_PROT_SW0: Software bit 0. * @KVM_PGTABLE_PROT_SW1: Software bit 1. * @KVM_PGTABLE_PROT_SW2: Software bit 2. @@ -208,6 +209,7 @@ enum kvm_pgtable_prot { KVM_PGTABLE_PROT_R = BIT(2), KVM_PGTABLE_PROT_DEVICE = BIT(3), + KVM_PGTABLE_PROT_NORMAL_NC = BIT(4), KVM_PGTABLE_PROT_SW0 = BIT(55), KVM_PGTABLE_PROT_SW1 = BIT(56), diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h index fde4186cc387..c247e5f29d5a 100644 --- a/arch/arm64/include/asm/memory.h +++ b/arch/arm64/include/asm/memory.h @@ -147,6 +147,7 @@ * Memory types for Stage-2 translation */ #define MT_S2_NORMAL 0xf +#define MT_S2_NORMAL_NC 0x5 #define MT_S2_DEVICE_nGnRE 0x1 /* @@ -154,6 +155,7 @@ * Stage-2 enforces Normal-WB and Device-nGnRE */ #define MT_S2_FWB_NORMAL 6 +#define MT_S2_FWB_NORMAL_NC 5 #define MT_S2_FWB_DEVICE_nGnRE 1 #ifdef CONFIG_ARM64_4K_PAGES diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index c651df904fe3..d4835d553c61 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -718,10 +718,17 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p kvm_pte_t *ptep) { bool device = prot & KVM_PGTABLE_PROT_DEVICE; - kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, DEVICE_nGnRE) : - KVM_S2_MEMATTR(pgt, NORMAL); + bool normal_nc = prot & KVM_PGTABLE_PROT_NORMAL_NC; + kvm_pte_t attr; u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS; + if (device) + attr = KVM_S2_MEMATTR(pgt, DEVICE_nGnRE); + else if (normal_nc) + attr = KVM_S2_MEMATTR(pgt, NORMAL_NC); + else + attr = KVM_S2_MEMATTR(pgt, NORMAL); + if (!(prot & KVM_PGTABLE_PROT_X)) attr |= KVM_PTE_LEAF_ATTR_HI_S2_XN; else if (device) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index d14504821b79..1cb302457d3f 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1071,7 +1071,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa, struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO }; struct kvm_s2_mmu *mmu = &kvm->arch.mmu; struct kvm_pgtable *pgt = mmu->pgt; - enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE | + enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_NORMAL_NC | KVM_PGTABLE_PROT_R | (writable ? KVM_PGTABLE_PROT_W : 0); @@ -1558,7 +1558,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, prot |= KVM_PGTABLE_PROT_X; if (device) - prot |= KVM_PGTABLE_PROT_DEVICE; + prot |= KVM_PGTABLE_PROT_NORMAL_NC; else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) prot |= KVM_PGTABLE_PROT_X;