Message ID | 158947414729.12590.4345248265094886807.stgit@gimli.home (mailing list archive) |
---|---|
Headers | show |
Series | vfio/type1/pci: IOMMU PFNMAP invalidation | expand |
On Thu, May 14, 2020 at 10:51:46AM -0600, Alex Williamson wrote: > This is a follow-on series to "vfio-pci: Block user access to disabled > device MMIO"[1], which extends user access blocking of disabled MMIO > ranges to include unmapping the ranges from the IOMMU. The first patch > adds an invalidation callback path, allowing vfio bus drivers to signal > the IOMMU backend to unmap ranges with vma level granularity. This > signaling is done both when the MMIO range becomes inaccessible due to > memory disabling, as well as when a vma is closed, making up for the > lack of tracking or pinning for non-page backed vmas. The second > patch adds registration and testing interfaces such that the IOMMU > backend driver can test whether a given PFNMAP vma is provided by a > vfio bus driver supporting invalidation. We can then implement more > restricted semantics to only allow PFNMAP DMA mappings when we have > such support, which becomes the new default. Hi, Alex, IIUC we'll directly tearing down the IOMMU page table without telling the userspace for those PFNMAP pages. I'm thinking whether there be any side effect on the userspace side when userspace cached these mapping information somehow. E.g., is there a way for userspace to know this event? Currently, QEMU VT-d will maintain all the IOVA mappings for each assigned device when used with vfio-pci. In this case, QEMU will probably need to depend some invalidations sent from the guest (either userspace or kernel) device drivers to invalidate such IOVA mappings after they got removed from the hardware IOMMU page table underneath. I haven't thought deeper on what would happen if the vIOMMU has got an inconsistent mapping of the real device. Thanks,
On Thu, 14 May 2020 17:25:38 -0400 Peter Xu <peterx@redhat.com> wrote: > On Thu, May 14, 2020 at 10:51:46AM -0600, Alex Williamson wrote: > > This is a follow-on series to "vfio-pci: Block user access to disabled > > device MMIO"[1], which extends user access blocking of disabled MMIO > > ranges to include unmapping the ranges from the IOMMU. The first patch > > adds an invalidation callback path, allowing vfio bus drivers to signal > > the IOMMU backend to unmap ranges with vma level granularity. This > > signaling is done both when the MMIO range becomes inaccessible due to > > memory disabling, as well as when a vma is closed, making up for the > > lack of tracking or pinning for non-page backed vmas. The second > > patch adds registration and testing interfaces such that the IOMMU > > backend driver can test whether a given PFNMAP vma is provided by a > > vfio bus driver supporting invalidation. We can then implement more > > restricted semantics to only allow PFNMAP DMA mappings when we have > > such support, which becomes the new default. > > Hi, Alex, > > IIUC we'll directly tearing down the IOMMU page table without telling the > userspace for those PFNMAP pages. I'm thinking whether there be any side > effect on the userspace side when userspace cached these mapping information > somehow. E.g., is there a way for userspace to know this event? > > Currently, QEMU VT-d will maintain all the IOVA mappings for each assigned > device when used with vfio-pci. In this case, QEMU will probably need to > depend some invalidations sent from the guest (either userspace or kernel) > device drivers to invalidate such IOVA mappings after they got removed from the > hardware IOMMU page table underneath. I haven't thought deeper on what would > happen if the vIOMMU has got an inconsistent mapping of the real device. Full disclosure, I haven't tested vIOMMU, there might be issues. Let's puzzle through this. Without a vIOMMU the vfio MemoryListener in QEMU makes use of address_space_memory, which is essentially the vCPU view of memory. When the memory bit of a PCI device is disabled, QEMU correctly removes the MMIO regions of the device from this AddressSpace. When re-enabled, they get re-added. In that case what we're doing here is a little bit redundant, the IOMMU mappings get dropped in response to the memory bit and the subsequent callback from the MemoryListener is effectively a no-op since the range is already unmapped. When the memory bit is re-enabled, the AddressSpace gets updated, the MemoryListener fires and we re-fault the mmap as we're re-adding the IOMMU mapping. When we have a vIOMMU, the address_space_memory behavior should be the same as above; the vIOMMU isn't involved in vCPU to device access. So I think our concern is explicit mappings, much like vfio itself makes. That feels like a gap. When the vma is being closed, I think dropping those mappings out from under the user is probably still the right approach and I think this series would still be useful if we only did that much. I think this would also address Jason's primary concern. It's better to get an IOMMU fault from the user trying to access those mappings than it is to leave them in place. OTOH, if we're dropping mappings in response to disabling the memory bit, we're changing a potential disabled MMIO access fault into an IOMMU fault, where the platform response might very well be fatal in either case. Maybe we need to look at a more temporary invalidation due to the memory enable bit. If we could remap the range to a kernel page we could maybe avoid the IOMMU fault and maybe even have a crude test for whether any data was written to the page while that mapping was in place (ie. simulating more restricted error handling, though more asynchronous than done at the platform level). Let me look into it. Thanks, Alex
On Thu, May 14, 2020 at 04:17:12PM -0600, Alex Williamson wrote: > that much. I think this would also address Jason's primary concern. > It's better to get an IOMMU fault from the user trying to access those > mappings than it is to leave them in place. Yes, there are few options here - if the pages are available for use by the IOMMU and *asynchronously* someone else revokes them, then the only way to protect the kernel is to block them from the IOMMUU. For this to be sane the revokation must be under complete control of the VFIO user. ie if a user decides to disable MMIO traffic then of course the IOMMU should block P2P transfer to the MMIO bar. It is user error to have not disabled those transfers in the first place. When this is all done inside a guest the whole logic applies. On bare metal you might get some AER or crash or MCE. In virtualization you'll get an IOMMU fault. > due to the memory enable bit. If we could remap the range to a kernel > page we could maybe avoid the IOMMU fault and maybe even have a crude > test for whether any data was written to the page while that mapping > was in place (ie. simulating more restricted error handling, though > more asynchronous than done at the platform level). I'm not if this makes sense, can't we arrange to directly trap the IOMMU failure and route it into qemu if that is what is desired? I'll try to look at this next week, swamped right now Thanks, Jason
On Thu, 14 May 2020 19:24:15 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Thu, May 14, 2020 at 04:17:12PM -0600, Alex Williamson wrote: > > > that much. I think this would also address Jason's primary concern. > > It's better to get an IOMMU fault from the user trying to access those > > mappings than it is to leave them in place. > > Yes, there are few options here - if the pages are available for use > by the IOMMU and *asynchronously* someone else revokes them, then the > only way to protect the kernel is to block them from the IOMMUU. > > For this to be sane the revokation must be under complete control of > the VFIO user. ie if a user decides to disable MMIO traffic then of > course the IOMMU should block P2P transfer to the MMIO bar. It is user > error to have not disabled those transfers in the first place. > > When this is all done inside a guest the whole logic applies. On bare > metal you might get some AER or crash or MCE. In virtualization you'll > get an IOMMU fault. > > > due to the memory enable bit. If we could remap the range to a kernel > > page we could maybe avoid the IOMMU fault and maybe even have a crude > > test for whether any data was written to the page while that mapping > > was in place (ie. simulating more restricted error handling, though > > more asynchronous than done at the platform level). > > I'm not if this makes sense, can't we arrange to directly trap the > IOMMU failure and route it into qemu if that is what is desired? Can't guarantee it, some systems wire that directly into their management processor so that they can "protect their users" regardless of whether they want or need it. Yay firmware first error handling, *sigh*. Thanks, Alex
On Thu, May 14, 2020 at 04:55:17PM -0600, Alex Williamson wrote: > > I'm not if this makes sense, can't we arrange to directly trap the > > IOMMU failure and route it into qemu if that is what is desired? > > Can't guarantee it, some systems wire that directly into their > management processor so that they can "protect their users" regardless > of whether they want or need it. Yay firmware first error handling, > *sigh*. Thanks, Sorry to be slightly out of topic - Alex, does this mean the general approach of fault reporting from vfio to the userspace is not gonna work too? Thanks,
On Fri, 15 May 2020 11:22:51 -0400 Peter Xu <peterx@redhat.com> wrote: > On Thu, May 14, 2020 at 04:55:17PM -0600, Alex Williamson wrote: > > > I'm not if this makes sense, can't we arrange to directly trap the > > > IOMMU failure and route it into qemu if that is what is desired? > > > > Can't guarantee it, some systems wire that directly into their > > management processor so that they can "protect their users" regardless > > of whether they want or need it. Yay firmware first error handling, > > *sigh*. Thanks, > > Sorry to be slightly out of topic - Alex, does this mean the general approach > of fault reporting from vfio to the userspace is not gonna work too? AFAIK these platforms only generate a fatal fault on certain classes of access which imply a potential for data loss, for example a DMA write to an invalid PTE entry. The actual IOMMU page faulting mechanism should not be affected by this, or at least one would hope. Thanks, Alex
On Thu, May 14, 2020 at 10:51:46AM -0600, Alex Williamson wrote: > This is a follow-on series to "vfio-pci: Block user access to disabled > device MMIO"[1], which extends user access blocking of disabled MMIO > ranges to include unmapping the ranges from the IOMMU. The first patch > adds an invalidation callback path, allowing vfio bus drivers to signal > the IOMMU backend to unmap ranges with vma level granularity. This > signaling is done both when the MMIO range becomes inaccessible due to > memory disabling, as well as when a vma is closed, making up for the > lack of tracking or pinning for non-page backed vmas. The second > patch adds registration and testing interfaces such that the IOMMU > backend driver can test whether a given PFNMAP vma is provided by a > vfio bus driver supporting invalidation. We can then implement more > restricted semantics to only allow PFNMAP DMA mappings when we have > such support, which becomes the new default. > > Jason, if you'd like Suggested-by credit for the ideas here I'd be > glad to add it. Thanks, Certainly a Reported-by would be OK The only thing I don't like here is this makes some P2P DMA mapping scheme for VMAs with invalidation that is completely private to vfio. Many of us want this in other subsystems, and there are legimiate uses for vfio to import BAR memory for P2P from other places than itself. So I would really rather this be supported by the core kernel in some way. That said, this is a bug fix, and we still don't have much agreement on what the core kernel version should look like, let alone how it should work with an IOMMU. So maybe this goes ahead as is and we can figure out how to replace it with something general later on? Jason
On Thu, May 14, 2020 at 04:55:17PM -0600, Alex Williamson wrote: > On Thu, 14 May 2020 19:24:15 -0300 > Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > On Thu, May 14, 2020 at 04:17:12PM -0600, Alex Williamson wrote: > > > > > that much. I think this would also address Jason's primary concern. > > > It's better to get an IOMMU fault from the user trying to access those > > > mappings than it is to leave them in place. > > > > Yes, there are few options here - if the pages are available for use > > by the IOMMU and *asynchronously* someone else revokes them, then the > > only way to protect the kernel is to block them from the IOMMUU. > > > > For this to be sane the revokation must be under complete control of > > the VFIO user. ie if a user decides to disable MMIO traffic then of > > course the IOMMU should block P2P transfer to the MMIO bar. It is user > > error to have not disabled those transfers in the first place. > > > > When this is all done inside a guest the whole logic applies. On bare > > metal you might get some AER or crash or MCE. In virtualization you'll > > get an IOMMU fault. > > > > > due to the memory enable bit. If we could remap the range to a kernel > > > page we could maybe avoid the IOMMU fault and maybe even have a crude > > > test for whether any data was written to the page while that mapping > > > was in place (ie. simulating more restricted error handling, though > > > more asynchronous than done at the platform level). > > > > I'm not if this makes sense, can't we arrange to directly trap the > > IOMMU failure and route it into qemu if that is what is desired? > > Can't guarantee it, some systems wire that directly into their > management processor so that they can "protect their users" regardless > of whether they want or need it. Yay firmware first error handling, > *sigh*. Thanks, I feel like those system should just loose the ability to reliably mirror IOMMU errors to their guests - trying to emulate it by scanning memory/etc sounds too horrible. Jason