Message ID | 20231214103520.7198-1-yan.y.zhao@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] KVM: Introduce KVM VIRTIO device | expand |
> From: Zhao, Yan Y <yan.y.zhao@intel.com> > Sent: Thursday, December 14, 2023 6:35 PM > > - For host non-MMIO pages, > * virtio guest frontend and host backend driver should be synced to use > the same memory type to map a buffer. Otherwise, there will be > potential problem for incorrect memory data. But this will only impact > the buggy guest alone. > * for live migration, > as QEMU will read all guest memory during live migration, page aliasing > could happen. > Current thinking is to disable live migration if a virtio device has > indicated its noncoherent state. > As a follow-up, we can discuss other solutions. e.g. > (a) switching back to coherent path before starting live migration. both guest/host switching to coherent or host-only? host-only certainly is problematic if guest is still using non-coherent. on the other hand I'm not sure whether the host/guest gfx stack is capable of switching between coherent and non-coherent path in-fly when the buffer is right being rendered. > (b) read/write of guest memory with clflush during live migration. write is irrelevant as it's only done in the resume path where the guest is not running. > > Implementation Consideration > === > There is a previous series [1] from google to serve the same purpose to > let KVM be aware of virtio GPU's noncoherent DMA status. That series > requires a new memslot flag, and special memslots in user space. > > We don't choose to use memslot flag to request honoring guest memory > type. memslot flag has the potential to restrict the impact e.g. when using clflush-before-read in migration? Of course the implication is to honor guest type only for the selected slot in KVM instead of applying to the entire guest memory as in previous series (which selects this way because vmx_get_mt_mask() is in perf-critical path hence not good to check memslot flag?) > Instead we hope to make the honoring request to be explicit (not tied to a > memslot flag). This is because once guest memory type is honored, not only > memory used by guest virtio device, but all guest memory is facing page > aliasing issue potentially. KVM needs a generic solution to take care of > page aliasing issue rather than counting on memory type of a special > memslot being aligned in host and guest. > (we can discuss what a generic solution to handle page aliasing issue will > look like in later follow-up series). > > On the other hand, we choose to introduce a KVM virtio device rather than > just provide an ioctl to wrap kvm_arch_[un]register_noncoherent_dma() > directly, which is based on considerations that I wonder it's over-engineered for the purpose. why not just introducing a KVM_CAP and allowing the VMM to enable? KVM doesn't need to know the exact source of requiring it...
On Fri, Dec 15, 2023 at 02:23:48PM +0800, Tian, Kevin wrote: > > From: Zhao, Yan Y <yan.y.zhao@intel.com> > > Sent: Thursday, December 14, 2023 6:35 PM > > > > - For host non-MMIO pages, > > * virtio guest frontend and host backend driver should be synced to use > > the same memory type to map a buffer. Otherwise, there will be > > potential problem for incorrect memory data. But this will only impact > > the buggy guest alone. > > * for live migration, > > as QEMU will read all guest memory during live migration, page aliasing > > could happen. > > Current thinking is to disable live migration if a virtio device has > > indicated its noncoherent state. > > As a follow-up, we can discuss other solutions. e.g. > > (a) switching back to coherent path before starting live migration. > > both guest/host switching to coherent or host-only? > > host-only certainly is problematic if guest is still using non-coherent. Both. > on the other hand I'm not sure whether the host/guest gfx stack is > capable of switching between coherent and non-coherent path in-fly > when the buffer is right being rendered. > Yes. I'm also not sure about it. But it's an option though. > > (b) read/write of guest memory with clflush during live migration. > > write is irrelevant as it's only done in the resume path where the > guest is not running. Given host write is with PAT WB and hardware is in no-snoop mode, is it better to perform cache flush after host write? (can do more investigation to check if it's necessary). BTW, there's also post-copy live migration, in which case the guest is running :) > > > > > Implementation Consideration > > === > > There is a previous series [1] from google to serve the same purpose to > > let KVM be aware of virtio GPU's noncoherent DMA status. That series > > requires a new memslot flag, and special memslots in user space. > > > > We don't choose to use memslot flag to request honoring guest memory > > type. > > memslot flag has the potential to restrict the impact e.g. when using > clflush-before-read in migration? Of course the implication is to > honor guest type only for the selected slot in KVM instead of applying > to the entire guest memory as in previous series (which selects this > way because vmx_get_mt_mask() is in perf-critical path hence not > good to check memslot flag?) > I think checking memslot flag in itself is all right. But memslot flag does not contain the memory type that host is using for the memslot. On the other hand, virtio GPU is not the only source of non-coherent DMAs. Memslot flag way is not applicable to pass-through GPUs, due to lacking of coordination between guest and host. > > Instead we hope to make the honoring request to be explicit (not tied to a > > memslot flag). This is because once guest memory type is honored, not only > > memory used by guest virtio device, but all guest memory is facing page > > aliasing issue potentially. KVM needs a generic solution to take care of > > page aliasing issue rather than counting on memory type of a special > > memslot being aligned in host and guest. > > (we can discuss what a generic solution to handle page aliasing issue will > > look like in later follow-up series). > > > > On the other hand, we choose to introduce a KVM virtio device rather than > > just provide an ioctl to wrap kvm_arch_[un]register_noncoherent_dma() > > directly, which is based on considerations that > > I wonder it's over-engineered for the purpose. > > why not just introducing a KVM_CAP and allowing the VMM to enable? As we hope to increase non-coherent DMA count on hot-plug of a non-coherent device and decrease non-coherent DMA count on hot-unplug of the non-coherent device, a KVM_CAP looks requiring user to maintain a ref count before turning on/off, which is less desired. Agree? > KVM doesn't need to know the exact source of requiring it... Maybe we can use the source info in a way like this: 1. indicate the source is not a passthrough device 2. record relationship between GPA and memory type. Then, if KVM knows non-coherent DMAs do not contain any passthrough devices, it can force a GPA's memory type (by ignoring guest PAT) to the one specified by host (in 2), so as to avoid cache flush operations before live migration. If there are passthrough devices involved later, we can zap the EPT and rebuild memory type to honor guest PAT, resorting to cache flush before live migration to maintain coherency.
+Yiwei On Fri, Dec 15, 2023, Kevin Tian wrote: > > From: Zhao, Yan Y <yan.y.zhao@intel.com> > > Sent: Thursday, December 14, 2023 6:35 PM > > > > - For host non-MMIO pages, > > * virtio guest frontend and host backend driver should be synced to use > > the same memory type to map a buffer. Otherwise, there will be > > potential problem for incorrect memory data. But this will only impact > > the buggy guest alone. > > * for live migration, > > as QEMU will read all guest memory during live migration, page aliasing > > could happen. > > Current thinking is to disable live migration if a virtio device has > > indicated its noncoherent state. > > As a follow-up, we can discuss other solutions. e.g. > > (a) switching back to coherent path before starting live migration. > > both guest/host switching to coherent or host-only? > > host-only certainly is problematic if guest is still using non-coherent. > > on the other hand I'm not sure whether the host/guest gfx stack is > capable of switching between coherent and non-coherent path in-fly > when the buffer is right being rendered. > > > (b) read/write of guest memory with clflush during live migration. > > write is irrelevant as it's only done in the resume path where the > guest is not running. > > > > > Implementation Consideration > > === > > There is a previous series [1] from google to serve the same purpose to > > let KVM be aware of virtio GPU's noncoherent DMA status. That series > > requires a new memslot flag, and special memslots in user space. > > > > We don't choose to use memslot flag to request honoring guest memory > > type. > > memslot flag has the potential to restrict the impact e.g. when using > clflush-before-read in migration? Yep, exactly. E.g. if KVM needs to ensure coherency when freeing memory back to the host kernel, then the memslot flag will allow for a much more targeted operation. > Of course the implication is to honor guest type only for the selected slot > in KVM instead of applying to the entire guest memory as in previous series > (which selects this way because vmx_get_mt_mask() is in perf-critical path > hence not good to check memslot flag?) Checking a memslot flag won't impact performance. KVM already has the memslot when creating SPTEs, e.g. the sole caller of vmx_get_mt_mask(), make_spte(), has access to the memslot. That isn't coincidental, KVM _must_ have the memslot to construct the SPTE, e.g. to retrieve the associated PFN, update write-tracking for shadow pages, etc. I added Yiwei, who I think is planning on posting another RFC for the memslot idea (I actually completely forgot that the memslot idea had been thought of and posted a few years back). > > Instead we hope to make the honoring request to be explicit (not tied to a > > memslot flag). This is because once guest memory type is honored, not only > > memory used by guest virtio device, but all guest memory is facing page > > aliasing issue potentially. KVM needs a generic solution to take care of > > page aliasing issue rather than counting on memory type of a special > > memslot being aligned in host and guest. > > (we can discuss what a generic solution to handle page aliasing issue will > > look like in later follow-up series). > > > > On the other hand, we choose to introduce a KVM virtio device rather than > > just provide an ioctl to wrap kvm_arch_[un]register_noncoherent_dma() > > directly, which is based on considerations that > > I wonder it's over-engineered for the purpose. > > why not just introducing a KVM_CAP and allowing the VMM to enable? > KVM doesn't need to know the exact source of requiring it... Agreed. If we end up needing to grant the whole VM access for some reason, just give userspace a direct toggle.
> +Yiwei > > On Fri, Dec 15, 2023, Kevin Tian wrote: > > > From: Zhao, Yan Y <yan.y.zhao@intel.com> > > > Sent: Thursday, December 14, 2023 6:35 PM > > > > > > - For host non-MMIO pages, > > > * virtio guest frontend and host backend driver should be synced to use > > > the same memory type to map a buffer. Otherwise, there will be > > > potential problem for incorrect memory data. But this will only impact > > > the buggy guest alone. > > > * for live migration, > > > as QEMU will read all guest memory during live migration, page aliasing > > > could happen. > > > Current thinking is to disable live migration if a virtio device has > > > indicated its noncoherent state. > > > As a follow-up, we can discuss other solutions. e.g. > > > (a) switching back to coherent path before starting live migration. > > > > both guest/host switching to coherent or host-only? > > > > host-only certainly is problematic if guest is still using non-coherent. > > > > on the other hand I'm not sure whether the host/guest gfx stack is > > capable of switching between coherent and non-coherent path in-fly > > when the buffer is right being rendered. > > > > > (b) read/write of guest memory with clflush during live migration. > > > > write is irrelevant as it's only done in the resume path where the > > guest is not running. > > > > > > > > Implementation Consideration > > > === > > > There is a previous series [1] from google to serve the same purpose to > > > let KVM be aware of virtio GPU's noncoherent DMA status. That series > > > requires a new memslot flag, and special memslots in user space. > > > > > > We don't choose to use memslot flag to request honoring guest memory > > > type. > > > > memslot flag has the potential to restrict the impact e.g. when using > > clflush-before-read in migration? > > Yep, exactly. E.g. if KVM needs to ensure coherency when freeing memory back to > the host kernel, then the memslot flag will allow for a much more targeted > operation. > > > Of course the implication is to honor guest type only for the selected slot > > in KVM instead of applying to the entire guest memory as in previous series > > (which selects this way because vmx_get_mt_mask() is in perf-critical path > > hence not good to check memslot flag?) > > Checking a memslot flag won't impact performance. KVM already has the memslot > when creating SPTEs, e.g. the sole caller of vmx_get_mt_mask(), make_spte(), has > access to the memslot. > > That isn't coincidental, KVM _must_ have the memslot to construct the SPTE, e.g. > to retrieve the associated PFN, update write-tracking for shadow pages, etc. > > I added Yiwei, who I think is planning on posting another RFC for the memslot > idea (I actually completely forgot that the memslot idea had been thought of and > posted a few years back). We've deferred to Yan (Intel side) to drive the userspace opt-in. So it's up to Yan to revise the series to be memslot flag based. I'm okay with what upstream folks think to be safer for the opt-in. Thanks! > > > Instead we hope to make the honoring request to be explicit (not tied to a > > > memslot flag). This is because once guest memory type is honored, not only > > > memory used by guest virtio device, but all guest memory is facing page > > > aliasing issue potentially. KVM needs a generic solution to take care of > > > page aliasing issue rather than counting on memory type of a special > > > memslot being aligned in host and guest. > > > (we can discuss what a generic solution to handle page aliasing issue will > > > look like in later follow-up series). > > > > > > On the other hand, we choose to introduce a KVM virtio device rather than > > > just provide an ioctl to wrap kvm_arch_[un]register_noncoherent_dma() > > > directly, which is based on considerations that > > > > I wonder it's over-engineered for the purpose. > > > > why not just introducing a KVM_CAP and allowing the VMM to enable? > > KVM doesn't need to know the exact source of requiring it... > > Agreed. If we end up needing to grant the whole VM access for some reason, just > give userspace a direct toggle.
On Mon, Dec 18, 2023 at 07:08:51AM -0800, Sean Christopherson wrote: > > > Implementation Consideration > > > === > > > There is a previous series [1] from google to serve the same purpose to > > > let KVM be aware of virtio GPU's noncoherent DMA status. That series > > > requires a new memslot flag, and special memslots in user space. > > > > > > We don't choose to use memslot flag to request honoring guest memory > > > type. > > > > memslot flag has the potential to restrict the impact e.g. when using > > clflush-before-read in migration? > > Yep, exactly. E.g. if KVM needs to ensure coherency when freeing memory back to > the host kernel, then the memslot flag will allow for a much more targeted > operation. > > > Of course the implication is to honor guest type only for the selected slot > > in KVM instead of applying to the entire guest memory as in previous series > > (which selects this way because vmx_get_mt_mask() is in perf-critical path > > hence not good to check memslot flag?) > > Checking a memslot flag won't impact performance. KVM already has the memslot > when creating SPTEs, e.g. the sole caller of vmx_get_mt_mask(), make_spte(), has > access to the memslot. > > That isn't coincidental, KVM _must_ have the memslot to construct the SPTE, e.g. > to retrieve the associated PFN, update write-tracking for shadow pages, etc. > Hi Sean, Do you prefer to introduce a memslot flag KVM_MEM_DMA or KVM_MEM_WC? For KVM_MEM_DMA, KVM needs to (a) search VMA for vma->vm_page_prot and convert it to page cache mode (with pgprot2cachemode()? ), or (b) look up memtype of the PFN, by calling lookup_memtype(), similar to that in pat_pfn_immune_to_uc_mtrr(). But pgprot2cachemode() and lookup_memtype() are not exported by x86 code now. For KVM_MEM_WC, it requires user to ensure the memory is actually mapped to WC, right? Then, vmx_get_mt_mask() just ignores guest PAT and programs host PAT as EPT type for the special memslot only, as below. Is this understanding correct? static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { if (is_mmio) return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; if (gfn_in_dma_slot(vcpu->kvm, gfn)) { u8 type = MTRR_TYPE_WRCOMB; //u8 type = pat_pfn_memtype(pfn); return (type << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; } if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) { if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; else return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; } return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT; } BTW, since the special memslot must be exposed to guest as virtio GPU BAR in order to prevent other guest drivers from access, I wonder if it's better to include some keyword like VIRTIO_GPU_BAR in memslot flag name.
On Tue, Dec 19, 2023 at 12:26:45PM +0800, Yan Zhao wrote: > On Mon, Dec 18, 2023 at 07:08:51AM -0800, Sean Christopherson wrote: > > > > Implementation Consideration > > > > === > > > > There is a previous series [1] from google to serve the same purpose to > > > > let KVM be aware of virtio GPU's noncoherent DMA status. That series > > > > requires a new memslot flag, and special memslots in user space. > > > > > > > > We don't choose to use memslot flag to request honoring guest memory > > > > type. > > > > > > memslot flag has the potential to restrict the impact e.g. when using > > > clflush-before-read in migration? > > > > Yep, exactly. E.g. if KVM needs to ensure coherency when freeing memory back to > > the host kernel, then the memslot flag will allow for a much more targeted > > operation. > > > > > Of course the implication is to honor guest type only for the selected slot > > > in KVM instead of applying to the entire guest memory as in previous series > > > (which selects this way because vmx_get_mt_mask() is in perf-critical path > > > hence not good to check memslot flag?) > > > > Checking a memslot flag won't impact performance. KVM already has the memslot > > when creating SPTEs, e.g. the sole caller of vmx_get_mt_mask(), make_spte(), has > > access to the memslot. > > > > That isn't coincidental, KVM _must_ have the memslot to construct the SPTE, e.g. > > to retrieve the associated PFN, update write-tracking for shadow pages, etc. > > > Hi Sean, > Do you prefer to introduce a memslot flag KVM_MEM_DMA or KVM_MEM_WC? > For KVM_MEM_DMA, KVM needs to > (a) search VMA for vma->vm_page_prot and convert it to page cache mode (with > pgprot2cachemode()? ), or > (b) look up memtype of the PFN, by calling lookup_memtype(), similar to that in > pat_pfn_immune_to_uc_mtrr(). > > But pgprot2cachemode() and lookup_memtype() are not exported by x86 code now. > > For KVM_MEM_WC, it requires user to ensure the memory is actually mapped > to WC, right? > > Then, vmx_get_mt_mask() just ignores guest PAT and programs host PAT as EPT type > for the special memslot only, as below. > Is this understanding correct? > > static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > { > if (is_mmio) > return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; > > if (gfn_in_dma_slot(vcpu->kvm, gfn)) { > u8 type = MTRR_TYPE_WRCOMB; > //u8 type = pat_pfn_memtype(pfn); > return (type << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > } > > if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) { > if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) > return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; > else > return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | > VMX_EPT_IPAT_BIT; > } > > return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT; > } > > BTW, since the special memslot must be exposed to guest as virtio GPU BAR in > order to prevent other guest drivers from access, I wonder if it's better to > include some keyword like VIRTIO_GPU_BAR in memslot flag name. Another choice is to add a memslot flag KVM_MEM_HONOR_GUEST_PAT, then user (e.g. QEMU) does special treatment to this kind of memslots (e.g. skipping reading/writing to them in general paths). @@ -7589,26 +7589,29 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) if (is_mmio) return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; + if (in_slot_honor_guest_pat(vcpu->kvm, gfn)) + return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT; + if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) { if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; else return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; } return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT; }
On Wed, Dec 20, 2023, Yan Zhao wrote: > On Tue, Dec 19, 2023 at 12:26:45PM +0800, Yan Zhao wrote: > > On Mon, Dec 18, 2023 at 07:08:51AM -0800, Sean Christopherson wrote: > > > > > Implementation Consideration > > > > > === > > > > > There is a previous series [1] from google to serve the same purpose to > > > > > let KVM be aware of virtio GPU's noncoherent DMA status. That series > > > > > requires a new memslot flag, and special memslots in user space. > > > > > > > > > > We don't choose to use memslot flag to request honoring guest memory > > > > > type. > > > > > > > > memslot flag has the potential to restrict the impact e.g. when using > > > > clflush-before-read in migration? > > > > > > Yep, exactly. E.g. if KVM needs to ensure coherency when freeing memory back to > > > the host kernel, then the memslot flag will allow for a much more targeted > > > operation. > > > > > > > Of course the implication is to honor guest type only for the selected slot > > > > in KVM instead of applying to the entire guest memory as in previous series > > > > (which selects this way because vmx_get_mt_mask() is in perf-critical path > > > > hence not good to check memslot flag?) > > > > > > Checking a memslot flag won't impact performance. KVM already has the memslot > > > when creating SPTEs, e.g. the sole caller of vmx_get_mt_mask(), make_spte(), has > > > access to the memslot. > > > > > > That isn't coincidental, KVM _must_ have the memslot to construct the SPTE, e.g. > > > to retrieve the associated PFN, update write-tracking for shadow pages, etc. > > > > > Hi Sean, > > Do you prefer to introduce a memslot flag KVM_MEM_DMA or KVM_MEM_WC? > > For KVM_MEM_DMA, KVM needs to > > (a) search VMA for vma->vm_page_prot and convert it to page cache mode (with > > pgprot2cachemode()? ), or > > (b) look up memtype of the PFN, by calling lookup_memtype(), similar to that in > > pat_pfn_immune_to_uc_mtrr(). > > > > But pgprot2cachemode() and lookup_memtype() are not exported by x86 code now. > > > > For KVM_MEM_WC, it requires user to ensure the memory is actually mapped > > to WC, right? > > > > Then, vmx_get_mt_mask() just ignores guest PAT and programs host PAT as EPT type > > for the special memslot only, as below. > > Is this understanding correct? > > > > static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > > { > > if (is_mmio) > > return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; > > > > if (gfn_in_dma_slot(vcpu->kvm, gfn)) { > > u8 type = MTRR_TYPE_WRCOMB; > > //u8 type = pat_pfn_memtype(pfn); > > return (type << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > } > > > > if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) > > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > > > if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) { > > if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) > > return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; > > else > > return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | > > VMX_EPT_IPAT_BIT; > > } > > > > return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT; > > } > > > > BTW, since the special memslot must be exposed to guest as virtio GPU BAR in > > order to prevent other guest drivers from access, I wonder if it's better to > > include some keyword like VIRTIO_GPU_BAR in memslot flag name. > Another choice is to add a memslot flag KVM_MEM_HONOR_GUEST_PAT, then user > (e.g. QEMU) does special treatment to this kind of memslots (e.g. skipping > reading/writing to them in general paths). > > @@ -7589,26 +7589,29 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > if (is_mmio) > return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; > > + if (in_slot_honor_guest_pat(vcpu->kvm, gfn)) > + return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT; This is more along the lines of what I was thinking, though the name should be something like KVM_MEM_NON_COHERENT_DMA, i.e. not x86 specific and not contradictory for AMD (which already honors guest PAT). I also vote to deliberately ignore MTRRs, i.e. start us on the path of ripping those out. This is a new feature, so we have the luxury of defining KVM's ABI for that feature, i.e. can state that on x86 it honors guest PAT, but not MTRRs. Like so? diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index d21f55f323ea..ed527acb2bd3 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7575,7 +7575,8 @@ static int vmx_vm_init(struct kvm *kvm) return 0; } -static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) +static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio, + struct kvm_memory_slot *slot) { /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in * memory aliases with conflicting memory types and sometimes MCEs. @@ -7598,6 +7599,9 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) if (is_mmio) return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; + if (kvm_memslot_has_non_coherent_dma(slot)) + return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; + if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; I like the idea of pulling the memtype from the host, but if we can make that work then I don't see the need for a special memslot flag, i.e. just do it for *all* SPTEs on VMX. I don't think we need a VMA for that, e.g. we should be able to get the memtype from the host PTEs, just like we do the page size. KVM_MEM_WC is a hard "no" for me. It's far too x86 centric, and as you alluded to, it requires coordination from the guest, i.e. is effectively limited to paravirt scenarios. > + > if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) { > if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) > return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; > else > return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | > VMX_EPT_IPAT_BIT; > } > > return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT; > }
On Wed, Dec 20, 2023 at 06:12:55PM -0800, Sean Christopherson wrote: > On Wed, Dec 20, 2023, Yan Zhao wrote: > > On Tue, Dec 19, 2023 at 12:26:45PM +0800, Yan Zhao wrote: > > > On Mon, Dec 18, 2023 at 07:08:51AM -0800, Sean Christopherson wrote: > > > > > > Implementation Consideration > > > > > > === > > > > > > There is a previous series [1] from google to serve the same purpose to > > > > > > let KVM be aware of virtio GPU's noncoherent DMA status. That series > > > > > > requires a new memslot flag, and special memslots in user space. > > > > > > > > > > > > We don't choose to use memslot flag to request honoring guest memory > > > > > > type. > > > > > > > > > > memslot flag has the potential to restrict the impact e.g. when using > > > > > clflush-before-read in migration? > > > > > > > > Yep, exactly. E.g. if KVM needs to ensure coherency when freeing memory back to > > > > the host kernel, then the memslot flag will allow for a much more targeted > > > > operation. > > > > > > > > > Of course the implication is to honor guest type only for the selected slot > > > > > in KVM instead of applying to the entire guest memory as in previous series > > > > > (which selects this way because vmx_get_mt_mask() is in perf-critical path > > > > > hence not good to check memslot flag?) > > > > > > > > Checking a memslot flag won't impact performance. KVM already has the memslot > > > > when creating SPTEs, e.g. the sole caller of vmx_get_mt_mask(), make_spte(), has > > > > access to the memslot. > > > > > > > > That isn't coincidental, KVM _must_ have the memslot to construct the SPTE, e.g. > > > > to retrieve the associated PFN, update write-tracking for shadow pages, etc. > > > > > > > Hi Sean, > > > Do you prefer to introduce a memslot flag KVM_MEM_DMA or KVM_MEM_WC? > > > For KVM_MEM_DMA, KVM needs to > > > (a) search VMA for vma->vm_page_prot and convert it to page cache mode (with > > > pgprot2cachemode()? ), or > > > (b) look up memtype of the PFN, by calling lookup_memtype(), similar to that in > > > pat_pfn_immune_to_uc_mtrr(). > > > > > > But pgprot2cachemode() and lookup_memtype() are not exported by x86 code now. > > > > > > For KVM_MEM_WC, it requires user to ensure the memory is actually mapped > > > to WC, right? > > > > > > Then, vmx_get_mt_mask() just ignores guest PAT and programs host PAT as EPT type > > > for the special memslot only, as below. > > > Is this understanding correct? > > > > > > static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > > > { > > > if (is_mmio) > > > return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; > > > > > > if (gfn_in_dma_slot(vcpu->kvm, gfn)) { > > > u8 type = MTRR_TYPE_WRCOMB; > > > //u8 type = pat_pfn_memtype(pfn); > > > return (type << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > > } > > > > > > if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) > > > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > > > > > if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) { > > > if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) > > > return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; > > > else > > > return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | > > > VMX_EPT_IPAT_BIT; > > > } > > > > > > return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT; > > > } > > > > > > BTW, since the special memslot must be exposed to guest as virtio GPU BAR in > > > order to prevent other guest drivers from access, I wonder if it's better to > > > include some keyword like VIRTIO_GPU_BAR in memslot flag name. > > Another choice is to add a memslot flag KVM_MEM_HONOR_GUEST_PAT, then user > > (e.g. QEMU) does special treatment to this kind of memslots (e.g. skipping > > reading/writing to them in general paths). > > > > @@ -7589,26 +7589,29 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > > if (is_mmio) > > return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; > > > > + if (in_slot_honor_guest_pat(vcpu->kvm, gfn)) > > + return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT; > > This is more along the lines of what I was thinking, though the name should be > something like KVM_MEM_NON_COHERENT_DMA, i.e. not x86 specific and not contradictory > for AMD (which already honors guest PAT). > > I also vote to deliberately ignore MTRRs, i.e. start us on the path of ripping > those out. This is a new feature, so we have the luxury of defining KVM's ABI > for that feature, i.e. can state that on x86 it honors guest PAT, but not MTRRs. > > Like so? Yes, this looks good to me. Will refine the patch in the recommended way. Only one remaining question from me is if we could allow user to set the flag to every slot. If user is allowed to set the flag to system ram slot, it may cause problem, since guest drivers (other than the paravirt one) may have chance to get allocated memory in this ram slot and make its PAT setting take effect, which is not desired. Do we need to find a way to limit the eligible slots? e.g. check VMAs of the slot to ensure its vm_flags include VM_IO and VM_PFNMAP. Or, just trust user? > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index d21f55f323ea..ed527acb2bd3 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -7575,7 +7575,8 @@ static int vmx_vm_init(struct kvm *kvm) > return 0; > } > > -static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > +static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio, > + struct kvm_memory_slot *slot) > { > /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in > * memory aliases with conflicting memory types and sometimes MCEs. > @@ -7598,6 +7599,9 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > if (is_mmio) > return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; > > + if (kvm_memslot_has_non_coherent_dma(slot)) > + return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; > + > if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > I like the idea of pulling the memtype from the host, but if we can make that > work then I don't see the need for a special memslot flag, i.e. just do it for > *all* SPTEs on VMX. I don't think we need a VMA for that, e.g. we should be able > to get the memtype from the host PTEs, just like we do the page size. Right. Besides, after reading host PTEs, we still need to decode the bits to memtype in platform specific way and convert the type to EPT type. > > KVM_MEM_WC is a hard "no" for me. It's far too x86 centric, and as you alluded > to, it requires coordination from the guest, i.e. is effectively limited to > paravirt scenarios. Got it! > > > + > > if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) > > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > > > if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) { > > if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) > > return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; > > else > > return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | > > VMX_EPT_IPAT_BIT; > > } > > > > return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT; > > }
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index b07247b0b958..9f7223d298a2 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -44,6 +44,7 @@ config KVM select KVM_XFER_TO_GUEST_WORK select KVM_GENERIC_DIRTYLOG_READ_PROTECT select KVM_VFIO + select KVM_VIRTIO select INTERVAL_TREE select HAVE_KVM_PM_NOTIFIER if PM select KVM_GENERIC_HARDWARE_ENABLING diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index b1f92a0edc35..7f57fa902a0a 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1394,6 +1394,9 @@ struct kvm_device_attr { #define KVM_DEV_VFIO_GROUP_DEL KVM_DEV_VFIO_FILE_DEL #define KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE 3 +#define KVM_DEV_VIRTIO_NONCOHERENT 1 +#define KVM_DEV_VIRTIO_NONCOHERENT_SET 1 + enum kvm_device_type { KVM_DEV_TYPE_FSL_MPIC_20 = 1, #define KVM_DEV_TYPE_FSL_MPIC_20 KVM_DEV_TYPE_FSL_MPIC_20 @@ -1417,6 +1420,8 @@ enum kvm_device_type { #define KVM_DEV_TYPE_ARM_PV_TIME KVM_DEV_TYPE_ARM_PV_TIME KVM_DEV_TYPE_RISCV_AIA, #define KVM_DEV_TYPE_RISCV_AIA KVM_DEV_TYPE_RISCV_AIA + KVM_DEV_TYPE_VIRTIO, +#define KVM_DEV_TYPE_VIRTIO KVM_DEV_TYPE_VIRTIO KVM_DEV_TYPE_MAX, }; diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index 6793211a0b64..5dcba034593c 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -56,6 +56,9 @@ config HAVE_KVM_CPU_RELAX_INTERCEPT config KVM_VFIO bool +config KVM_VIRTIO + bool + config HAVE_KVM_INVALID_WAKEUPS bool diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm index 724c89af78af..614c4c4fe50e 100644 --- a/virt/kvm/Makefile.kvm +++ b/virt/kvm/Makefile.kvm @@ -7,6 +7,7 @@ KVM ?= ../../../virt/kvm kvm-y := $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/binary_stats.o kvm-$(CONFIG_KVM_VFIO) += $(KVM)/vfio.o +kvm-$(CONFIG_KVM_VIRTIO) += $(KVM)/virtio.o kvm-$(CONFIG_KVM_MMIO) += $(KVM)/coalesced_mmio.o kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o kvm-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index acd67fb40183..dca2040368da 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -61,6 +61,7 @@ #include "async_pf.h" #include "kvm_mm.h" #include "vfio.h" +#include "virtio.h" #include <trace/events/ipi.h> @@ -6453,6 +6454,10 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module) if (WARN_ON_ONCE(r)) goto err_vfio; + r = kvm_virtio_ops_init(); + if (WARN_ON_ONCE(r)) + goto err_virtio; + kvm_gmem_init(module); /* @@ -6468,6 +6473,8 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module) return 0; err_register: + kvm_virtio_ops_exit(); +err_virtio: kvm_vfio_ops_exit(); err_vfio: kvm_async_pf_deinit(); @@ -6503,6 +6510,7 @@ void kvm_exit(void) free_cpumask_var(per_cpu(cpu_kick_mask, cpu)); kmem_cache_destroy(kvm_vcpu_cache); kvm_vfio_ops_exit(); + kvm_virtio_ops_exit(); kvm_async_pf_deinit(); #ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING unregister_syscore_ops(&kvm_syscore_ops); diff --git a/virt/kvm/virtio.c b/virt/kvm/virtio.c new file mode 100644 index 000000000000..dbb25d9784c5 --- /dev/null +++ b/virt/kvm/virtio.c @@ -0,0 +1,121 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * KVM-VIRTIO device + * + */ +#include <linux/kvm_host.h> +#include <linux/errno.h> +#include <linux/mutex.h> +#include <linux/slab.h> +#include "virtio.h" + +struct kvm_virtio { + struct mutex lock; + bool noncoherent; +}; + +static int kvm_virtio_set_noncoherent(struct kvm_device *dev, long attr, + void __user *arg) +{ + struct kvm_virtio *kv = dev->private; + + /* + * Currently, only set to noncoherent is allowed, and therefore a virtio + * device is not allowed to switch back to coherent once it's set to + * noncoherent. + * User arg is also not checked as the attr name has indicated that the + * purpose is to set to noncoherent. + */ + if (attr != KVM_DEV_VIRTIO_NONCOHERENT_SET) + return -ENXIO; + + mutex_lock(&kv->lock); + if (kv->noncoherent) + goto out; + + kv->noncoherent = true; + kvm_arch_register_noncoherent_dma(dev->kvm); +out: + mutex_unlock(&kv->lock); + return 0; +} + +static int kvm_virtio_set_attr(struct kvm_device *dev, + struct kvm_device_attr *attr) +{ + switch (attr->group) { + case KVM_DEV_VIRTIO_NONCOHERENT: + return kvm_virtio_set_noncoherent(dev, attr->attr, + u64_to_user_ptr(attr->addr)); + } + + return -ENXIO; +} + +static int kvm_virtio_has_attr(struct kvm_device *dev, + struct kvm_device_attr *attr) +{ + switch (attr->group) { + case KVM_DEV_VIRTIO_NONCOHERENT: + switch (attr->attr) { + case KVM_DEV_VIRTIO_NONCOHERENT_SET: + return 0; + } + + break; + } + + return -ENXIO; +} + +static void kvm_virtio_release(struct kvm_device *dev) +{ + struct kvm_virtio *kv = dev->private; + + if (kv->noncoherent) + kvm_arch_unregister_noncoherent_dma(dev->kvm); + kfree(kv); + kfree(dev); /* alloc by kvm_ioctl_create_device, free by .release */ +} + +static int kvm_virtio_create(struct kvm_device *dev, u32 type); + +static struct kvm_device_ops kvm_virtio_ops = { + .name = "kvm-virtio", + .create = kvm_virtio_create, + .release = kvm_virtio_release, + .set_attr = kvm_virtio_set_attr, + .has_attr = kvm_virtio_has_attr, +}; + +static int kvm_virtio_create(struct kvm_device *dev, u32 type) +{ + struct kvm_virtio *kv; + + if (type != KVM_DEV_TYPE_VIRTIO) + return -ENODEV; + + /* + * This kvm_virtio device is created per virtio device. + * Its default noncoherent state is false. + */ + kv = kzalloc(sizeof(*kv), GFP_KERNEL_ACCOUNT); + if (!kv) + return -ENOMEM; + + mutex_init(&kv->lock); + + dev->private = kv; + + return 0; +} + +int kvm_virtio_ops_init(void) +{ + return kvm_register_device_ops(&kvm_virtio_ops, KVM_DEV_TYPE_VIRTIO); +} + +void kvm_virtio_ops_exit(void) +{ + kvm_unregister_device_ops(KVM_DEV_TYPE_VIRTIO); +} diff --git a/virt/kvm/virtio.h b/virt/kvm/virtio.h new file mode 100644 index 000000000000..0353398ee1f2 --- /dev/null +++ b/virt/kvm/virtio.h @@ -0,0 +1,18 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __KVM_VIRTIO_H +#define __KVM_VIRTIO_H + +#ifdef CONFIG_KVM_VIRTIO +int kvm_virtio_ops_init(void); +void kvm_virtio_ops_exit(void); +#else +static inline int kvm_virtio_ops_init(void) +{ + return 0; +} +static inline void kvm_virtio_ops_exit(void) +{ +} +#endif + +#endif