Message ID | 0-v1-331b76591255+552-vfio_sme_jgg@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vfio-pci: Use io_remap_pfn_range() for PCI IO memory | expand |
On Thu, Nov 05, 2020 at 12:34:58PM -0400, Jason Gunthorpe wrote: > Tom says VFIO device assignment works OK with KVM, so I expect only things > like DPDK to be broken. Is there more information on why the difference? Thanks,
On Thu, Nov 05, 2020 at 06:39:49PM -0500, Peter Xu wrote: > On Thu, Nov 05, 2020 at 12:34:58PM -0400, Jason Gunthorpe wrote: > > Tom says VFIO device assignment works OK with KVM, so I expect only things > > like DPDK to be broken. > > Is there more information on why the difference? Thanks, I have nothing, maybe Tom can explain how it works? Jason
On 11/16/20 9:53 AM, Jason Gunthorpe wrote: > On Thu, Nov 05, 2020 at 06:39:49PM -0500, Peter Xu wrote: >> On Thu, Nov 05, 2020 at 12:34:58PM -0400, Jason Gunthorpe wrote: >>> Tom says VFIO device assignment works OK with KVM, so I expect only things >>> like DPDK to be broken. >> >> Is there more information on why the difference? Thanks, > > I have nothing, maybe Tom can explain how it works? IIUC, the main differences would be along the lines of what is performing the mappings or who is performing the MMIO. For device passthrough using VFIO, the guest kernel is the one that ends up performing the MMIO in kernel space with the proper encryption mask (unencrypted). I'm not familiar with how DPDK really works other than it is userspace based and uses polling drivers, etc. So it all depends on how everything gets mapped and by whom. For example, using mmap() to get a mapping to something that should be mapped unencrypted will be an issue since the userspace mappings are created encrypted. Extending mmap() to be able to specify a new flag, maybe MAP_UNENCRYPTED, might be something to consider. Thanks, Tom > > Jason >
On Mon, Nov 16, 2020 at 03:43:53PM -0600, Tom Lendacky wrote: > On 11/16/20 9:53 AM, Jason Gunthorpe wrote: > > On Thu, Nov 05, 2020 at 06:39:49PM -0500, Peter Xu wrote: > >> On Thu, Nov 05, 2020 at 12:34:58PM -0400, Jason Gunthorpe wrote: > >>> Tom says VFIO device assignment works OK with KVM, so I expect only things > >>> like DPDK to be broken. > >> > >> Is there more information on why the difference? Thanks, > > > > I have nothing, maybe Tom can explain how it works? > > IIUC, the main differences would be along the lines of what is performing > the mappings or who is performing the MMIO. > > For device passthrough using VFIO, the guest kernel is the one that ends > up performing the MMIO in kernel space with the proper encryption mask > (unencrypted). The question here is why does VF assignment work if the MMIO mapping in the hypervisor is being marked encrypted. It sounds like this means the page table in the hypervisor is ignored, and it works because the VM's kernel marks the guest's page table as non-encrypted? > I'm not familiar with how DPDK really works other than it is userspace > based and uses polling drivers, etc. So it all depends on how everything > gets mapped and by whom. For example, using mmap() to get a mapping to > something that should be mapped unencrypted will be an issue since the > userspace mappings are created encrypted. It is the same as the rdma stuff, DPDK calls mmap against VFIO which calls remap_pfn and creates encrypted mappings > Extending mmap() to be able to specify a new flag, maybe > MAP_UNENCRYPTED, might be something to consider. Not sure how this makes sense here, the kernel knows the should not be encrypted.. Jason
On 11/16/20 5:20 PM, Jason Gunthorpe wrote: > On Mon, Nov 16, 2020 at 03:43:53PM -0600, Tom Lendacky wrote: >> On 11/16/20 9:53 AM, Jason Gunthorpe wrote: >>> On Thu, Nov 05, 2020 at 06:39:49PM -0500, Peter Xu wrote: >>>> On Thu, Nov 05, 2020 at 12:34:58PM -0400, Jason Gunthorpe wrote: >>>>> Tom says VFIO device assignment works OK with KVM, so I expect only things >>>>> like DPDK to be broken. >>>> >>>> Is there more information on why the difference? Thanks, >>> >>> I have nothing, maybe Tom can explain how it works? >> >> IIUC, the main differences would be along the lines of what is performing >> the mappings or who is performing the MMIO. >> >> For device passthrough using VFIO, the guest kernel is the one that ends >> up performing the MMIO in kernel space with the proper encryption mask >> (unencrypted). > > The question here is why does VF assignment work if the MMIO mapping > in the hypervisor is being marked encrypted. > > It sounds like this means the page table in the hypervisor is ignored, > and it works because the VM's kernel marks the guest's page table as > non-encrypted? If I understand the VFIO code correctly, the MMIO area gets registered as a RAM memory region and added to the guest. This MMIO region is accessed in the guest through ioremap(), which creates an un-encrypted mapping, allowing the guest to read it properly. So I believe the mmap() call only provides the information used to register the memory region for guest access and is not directly accessed by Qemu (I don't believe the guest VMEXITs for the MMIO access, but I could be wrong). > >> I'm not familiar with how DPDK really works other than it is userspace >> based and uses polling drivers, etc. So it all depends on how everything >> gets mapped and by whom. For example, using mmap() to get a mapping to >> something that should be mapped unencrypted will be an issue since the >> userspace mappings are created encrypted. > > It is the same as the rdma stuff, DPDK calls mmap against VFIO which > calls remap_pfn and creates encrypted mappings > >> Extending mmap() to be able to specify a new flag, maybe >> MAP_UNENCRYPTED, might be something to consider. > > Not sure how this makes sense here, the kernel knows the should not be > encrypted.. Yeah, not in this case. Was just a general comment on whether to allow userspace to do something like that on any mmap(). Thanks, Tom > > Jason >
On Tue, 17 Nov 2020 09:33:17 -0600 Tom Lendacky <thomas.lendacky@amd.com> wrote: > On 11/16/20 5:20 PM, Jason Gunthorpe wrote: > > On Mon, Nov 16, 2020 at 03:43:53PM -0600, Tom Lendacky wrote: > >> On 11/16/20 9:53 AM, Jason Gunthorpe wrote: > >>> On Thu, Nov 05, 2020 at 06:39:49PM -0500, Peter Xu wrote: > >>>> On Thu, Nov 05, 2020 at 12:34:58PM -0400, Jason Gunthorpe wrote: > >>>>> Tom says VFIO device assignment works OK with KVM, so I expect only things > >>>>> like DPDK to be broken. > >>>> > >>>> Is there more information on why the difference? Thanks, > >>> > >>> I have nothing, maybe Tom can explain how it works? > >> > >> IIUC, the main differences would be along the lines of what is performing > >> the mappings or who is performing the MMIO. > >> > >> For device passthrough using VFIO, the guest kernel is the one that ends > >> up performing the MMIO in kernel space with the proper encryption mask > >> (unencrypted). > > > > The question here is why does VF assignment work if the MMIO mapping > > in the hypervisor is being marked encrypted. > > > > It sounds like this means the page table in the hypervisor is ignored, > > and it works because the VM's kernel marks the guest's page table as > > non-encrypted? > > If I understand the VFIO code correctly, the MMIO area gets registered as > a RAM memory region and added to the guest. This MMIO region is accessed > in the guest through ioremap(), which creates an un-encrypted mapping, > allowing the guest to read it properly. So I believe the mmap() call only > provides the information used to register the memory region for guest > access and is not directly accessed by Qemu (I don't believe the guest > VMEXITs for the MMIO access, but I could be wrong). Ideally it won't, but trapping through QEMU is a common debugging technique and required if we implement virtualization quirks for a device in QEMU. So I believe what you're saying is that device assignment on SEV probably works only when we're using direct mapping of the mmap into the VM and tracing or quirks would currently see encrypted data. Has anyone had the opportunity to check that we don't break device assignment to VMs with this patch? Thanks, Alex
On Tue, Nov 17, 2020 at 09:33:17AM -0600, Tom Lendacky wrote: > On 11/16/20 5:20 PM, Jason Gunthorpe wrote: > > On Mon, Nov 16, 2020 at 03:43:53PM -0600, Tom Lendacky wrote: > >> On 11/16/20 9:53 AM, Jason Gunthorpe wrote: > >>> On Thu, Nov 05, 2020 at 06:39:49PM -0500, Peter Xu wrote: > >>>> On Thu, Nov 05, 2020 at 12:34:58PM -0400, Jason Gunthorpe wrote: > >>>>> Tom says VFIO device assignment works OK with KVM, so I expect only things > >>>>> like DPDK to be broken. > >>>> > >>>> Is there more information on why the difference? Thanks, > >>> > >>> I have nothing, maybe Tom can explain how it works? > >> > >> IIUC, the main differences would be along the lines of what is performing > >> the mappings or who is performing the MMIO. > >> > >> For device passthrough using VFIO, the guest kernel is the one that ends > >> up performing the MMIO in kernel space with the proper encryption mask > >> (unencrypted). > > > > The question here is why does VF assignment work if the MMIO mapping > > in the hypervisor is being marked encrypted. > > > > It sounds like this means the page table in the hypervisor is ignored, > > and it works because the VM's kernel marks the guest's page table as > > non-encrypted? > > If I understand the VFIO code correctly, the MMIO area gets registered as > a RAM memory region and added to the guest. This MMIO region is accessed > in the guest through ioremap(), which creates an un-encrypted mapping, > allowing the guest to read it properly. So I believe the mmap() call only > provides the information used to register the memory region for guest > access and is not directly accessed by Qemu (I don't believe the guest > VMEXITs for the MMIO access, but I could be wrong). Thanks for the explanations. It seems fine if two dimentional page table is used in kvm, as long as the 1st level guest page table is handled the same way as in the host. I'm thinking what if shadow page table is used - IIUC here the vfio mmio region will be the same as normal guest RAM from kvm memslot pov, however if the mmio region is not encrypted, does it also mean that the whole guest RAM is not encrypted too? It's a pure question because I feel like these are two layers of security (host as the 1st, guest as the 2nd), maybe here we're only talking about host security rather than the guests, then it looks fine too.
On 11/17/20 9:57 AM, Peter Xu wrote: > On Tue, Nov 17, 2020 at 09:33:17AM -0600, Tom Lendacky wrote: >> On 11/16/20 5:20 PM, Jason Gunthorpe wrote: >>> On Mon, Nov 16, 2020 at 03:43:53PM -0600, Tom Lendacky wrote: >>>> On 11/16/20 9:53 AM, Jason Gunthorpe wrote: >>>>> On Thu, Nov 05, 2020 at 06:39:49PM -0500, Peter Xu wrote: >>>>>> On Thu, Nov 05, 2020 at 12:34:58PM -0400, Jason Gunthorpe wrote: >>>>>>> Tom says VFIO device assignment works OK with KVM, so I expect only things >>>>>>> like DPDK to be broken. >>>>>> >>>>>> Is there more information on why the difference? Thanks, >>>>> >>>>> I have nothing, maybe Tom can explain how it works? >>>> >>>> IIUC, the main differences would be along the lines of what is performing >>>> the mappings or who is performing the MMIO. >>>> >>>> For device passthrough using VFIO, the guest kernel is the one that ends >>>> up performing the MMIO in kernel space with the proper encryption mask >>>> (unencrypted). >>> >>> The question here is why does VF assignment work if the MMIO mapping >>> in the hypervisor is being marked encrypted. >>> >>> It sounds like this means the page table in the hypervisor is ignored, >>> and it works because the VM's kernel marks the guest's page table as >>> non-encrypted? >> >> If I understand the VFIO code correctly, the MMIO area gets registered as >> a RAM memory region and added to the guest. This MMIO region is accessed >> in the guest through ioremap(), which creates an un-encrypted mapping, >> allowing the guest to read it properly. So I believe the mmap() call only >> provides the information used to register the memory region for guest >> access and is not directly accessed by Qemu (I don't believe the guest >> VMEXITs for the MMIO access, but I could be wrong). > > Thanks for the explanations. > > It seems fine if two dimentional page table is used in kvm, as long as the 1st > level guest page table is handled the same way as in the host. > > I'm thinking what if shadow page table is used - IIUC here the vfio mmio region > will be the same as normal guest RAM from kvm memslot pov, however if the mmio > region is not encrypted, does it also mean that the whole guest RAM is not > encrypted too? It's a pure question because I feel like these are two layers > of security (host as the 1st, guest as the 2nd), maybe here we're only talking > about host security rather than the guests, then it looks fine too. SEV is only supported with NPT (TDP). Thanks, Tom >
On 11/17/20 9:54 AM, Alex Williamson wrote: > On Tue, 17 Nov 2020 09:33:17 -0600 > Tom Lendacky <thomas.lendacky@amd.com> wrote: > >> On 11/16/20 5:20 PM, Jason Gunthorpe wrote: >>> On Mon, Nov 16, 2020 at 03:43:53PM -0600, Tom Lendacky wrote: >>>> On 11/16/20 9:53 AM, Jason Gunthorpe wrote: >>>>> On Thu, Nov 05, 2020 at 06:39:49PM -0500, Peter Xu wrote: >>>>>> On Thu, Nov 05, 2020 at 12:34:58PM -0400, Jason Gunthorpe wrote: >>>>>>> Tom says VFIO device assignment works OK with KVM, so I expect only things >>>>>>> like DPDK to be broken. >>>>>> >>>>>> Is there more information on why the difference? Thanks, >>>>> >>>>> I have nothing, maybe Tom can explain how it works? >>>> >>>> IIUC, the main differences would be along the lines of what is performing >>>> the mappings or who is performing the MMIO. >>>> >>>> For device passthrough using VFIO, the guest kernel is the one that ends >>>> up performing the MMIO in kernel space with the proper encryption mask >>>> (unencrypted). >>> >>> The question here is why does VF assignment work if the MMIO mapping >>> in the hypervisor is being marked encrypted. >>> >>> It sounds like this means the page table in the hypervisor is ignored, >>> and it works because the VM's kernel marks the guest's page table as >>> non-encrypted? >> >> If I understand the VFIO code correctly, the MMIO area gets registered as >> a RAM memory region and added to the guest. This MMIO region is accessed >> in the guest through ioremap(), which creates an un-encrypted mapping, >> allowing the guest to read it properly. So I believe the mmap() call only >> provides the information used to register the memory region for guest >> access and is not directly accessed by Qemu (I don't believe the guest >> VMEXITs for the MMIO access, but I could be wrong). > > Ideally it won't, but trapping through QEMU is a common debugging > technique and required if we implement virtualization quirks for a > device in QEMU. So I believe what you're saying is that device > assignment on SEV probably works only when we're using direct mapping > of the mmap into the VM and tracing or quirks would currently see > encrypted data. Has anyone had the opportunity to check that we don't > break device assignment to VMs with this patch? Thanks, I have not been able to test device assignment with this patch, yet. Jason? Thanks, Tom > > Alex >
On Tue, Nov 17, 2020 at 10:37:18AM -0600, Tom Lendacky wrote: > > Ideally it won't, but trapping through QEMU is a common debugging > > technique and required if we implement virtualization quirks for a > > device in QEMU. So I believe what you're saying is that device > > assignment on SEV probably works only when we're using direct mapping > > of the mmap into the VM and tracing or quirks would currently see > > encrypted data. Has anyone had the opportunity to check that we don't > > break device assignment to VMs with this patch? Thanks, > > I have not been able to test device assignment with this patch, yet. Jason? I don't have SME systems, we have a customer that reported RDMA didn't work and confirmed the similar RDMA patch worked. I know VFIO is basically identical, so it should be applicable here too. Jasnon
On 11/17/20 11:07 AM, Jason Gunthorpe wrote: > On Tue, Nov 17, 2020 at 10:37:18AM -0600, Tom Lendacky wrote: > >>> Ideally it won't, but trapping through QEMU is a common debugging >>> technique and required if we implement virtualization quirks for a >>> device in QEMU. So I believe what you're saying is that device >>> assignment on SEV probably works only when we're using direct mapping >>> of the mmap into the VM and tracing or quirks would currently see >>> encrypted data. Has anyone had the opportunity to check that we don't >>> break device assignment to VMs with this patch? Thanks, >> >> I have not been able to test device assignment with this patch, yet. Jason? > > I don't have SME systems, we have a customer that reported RDMA didn't > work and confirmed the similar RDMA patch worked. Right, I think Alex was asking about device assignment to a guest in general, regardless of SME. Thanks, Tom > > I know VFIO is basically identical, so it should be applicable here > too. > > Jasnon >
On Tue, Nov 17, 2020 at 10:34:37AM -0600, Tom Lendacky wrote: > On 11/17/20 9:57 AM, Peter Xu wrote: > > On Tue, Nov 17, 2020 at 09:33:17AM -0600, Tom Lendacky wrote: > >> On 11/16/20 5:20 PM, Jason Gunthorpe wrote: > >>> On Mon, Nov 16, 2020 at 03:43:53PM -0600, Tom Lendacky wrote: > >>>> On 11/16/20 9:53 AM, Jason Gunthorpe wrote: > >>>>> On Thu, Nov 05, 2020 at 06:39:49PM -0500, Peter Xu wrote: > >>>>>> On Thu, Nov 05, 2020 at 12:34:58PM -0400, Jason Gunthorpe wrote: > >>>>>>> Tom says VFIO device assignment works OK with KVM, so I expect only things > >>>>>>> like DPDK to be broken. > >>>>>> > >>>>>> Is there more information on why the difference? Thanks, > >>>>> > >>>>> I have nothing, maybe Tom can explain how it works? > >>>> > >>>> IIUC, the main differences would be along the lines of what is performing > >>>> the mappings or who is performing the MMIO. > >>>> > >>>> For device passthrough using VFIO, the guest kernel is the one that ends > >>>> up performing the MMIO in kernel space with the proper encryption mask > >>>> (unencrypted). > >>> > >>> The question here is why does VF assignment work if the MMIO mapping > >>> in the hypervisor is being marked encrypted. > >>> > >>> It sounds like this means the page table in the hypervisor is ignored, > >>> and it works because the VM's kernel marks the guest's page table as > >>> non-encrypted? > >> > >> If I understand the VFIO code correctly, the MMIO area gets registered as > >> a RAM memory region and added to the guest. This MMIO region is accessed > >> in the guest through ioremap(), which creates an un-encrypted mapping, > >> allowing the guest to read it properly. So I believe the mmap() call only > >> provides the information used to register the memory region for guest > >> access and is not directly accessed by Qemu (I don't believe the guest > >> VMEXITs for the MMIO access, but I could be wrong). > > > > Thanks for the explanations. > > > > It seems fine if two dimentional page table is used in kvm, as long as the 1st > > level guest page table is handled the same way as in the host. > > > > I'm thinking what if shadow page table is used - IIUC here the vfio mmio region > > will be the same as normal guest RAM from kvm memslot pov, however if the mmio > > region is not encrypted, does it also mean that the whole guest RAM is not > > encrypted too? It's a pure question because I feel like these are two layers > > of security (host as the 1st, guest as the 2nd), maybe here we're only talking > > about host security rather than the guests, then it looks fine too. > > SEV is only supported with NPT (TDP). I see, thanks for answering (even if my question was kind of out-of-topic..). Regarding this patch, my current understanding is that the VM case worked only because the guests in the previous tests were always using kvm directly mapped MMIO accesses. However that should not be always guaranteed because qemu should be in complete control of that (e.g., qemu can switch to user-exit for all mmio accesses for a vfio-pci device anytime without guest's awareness). Logically this patch should fix that, just like the dpdk scenario where mmio regions were accessed from userspace (qemu). From that pov, I think this patch should help. Acked-by: Peter Xu <peterx@redhat.com> Though if my above understanding is correct, it would be nice to mention some of above information in the commit messages too, though may not worth a repost. Tests will always be welcomed as suggested by Alex, of course. Thanks,
On Tue, Nov 17, 2020 at 01:17:54PM -0500, Peter Xu wrote: > Logically this patch should fix that, just like the dpdk scenario where mmio > regions were accessed from userspace (qemu). From that pov, I think this patch > should help. > > Acked-by: Peter Xu <peterx@redhat.com> Thanks Peter Is there more to do here? Jason
On 11/26/20 2:13 PM, Jason Gunthorpe wrote: > On Tue, Nov 17, 2020 at 01:17:54PM -0500, Peter Xu wrote: > >> Logically this patch should fix that, just like the dpdk scenario where mmio >> regions were accessed from userspace (qemu). From that pov, I think this patch >> should help. >> >> Acked-by: Peter Xu <peterx@redhat.com> > > Thanks Peter > > Is there more to do here? I just did a quick, limited passthrough test of a NIC device (non SRIOV) for a legacy and an SEV guest and it all appears to work. I don't have anything more (i.e. SRIOV, GPUs, etc.) with which to test device passthrough. Thanks, Tom > > Jason >
On Mon, 30 Nov 2020 08:34:51 -0600 Tom Lendacky <thomas.lendacky@amd.com> wrote: > On 11/26/20 2:13 PM, Jason Gunthorpe wrote: > > On Tue, Nov 17, 2020 at 01:17:54PM -0500, Peter Xu wrote: > > > >> Logically this patch should fix that, just like the dpdk scenario where mmio > >> regions were accessed from userspace (qemu). From that pov, I think this patch > >> should help. > >> > >> Acked-by: Peter Xu <peterx@redhat.com> > > > > Thanks Peter > > > > Is there more to do here? > > I just did a quick, limited passthrough test of a NIC device (non SRIOV) > for a legacy and an SEV guest and it all appears to work. > > I don't have anything more (i.e. SRIOV, GPUs, etc.) with which to test > device passthrough. Thanks, I'll include this for v5.11. Alex
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index fbd2b3404184ba..1853cc2548c966 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -1635,8 +1635,8 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) mutex_unlock(&vdev->vma_lock); - if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, - vma->vm_end - vma->vm_start, vma->vm_page_prot)) + if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, + vma->vm_end - vma->vm_start, vma->vm_page_prot)) ret = VM_FAULT_SIGBUS; up_out:
commit f8f6ae5d077a ("mm: always have io_remap_pfn_range() set pgprot_decrypted()") allows drivers using mmap to put PCI memory mapped BAR space into userspace to work correctly on AMD SME systems that default to all memory encrypted. Since vfio_pci_mmap_fault() is working with PCI memory mapped BAR space it should be calling io_remap_pfn_range() otherwise it will not work on SME systems. Fixes: 11c4cd07ba11 ("vfio-pci: Fault mmaps to enable vma tracking") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> --- drivers/vfio/pci/vfio_pci.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) The io_remap_pfn_range() commit is in Linus's tree and will be in rc3, but there is no cross dependency here. Tom says VFIO device assignment works OK with KVM, so I expect only things like DPDK to be broken. Don't have SME hardware, can't test.