mbox series

[0/2] vfio/type1/pci: IOMMU PFNMAP invalidation

Message ID 158947414729.12590.4345248265094886807.stgit@gimli.home (mailing list archive)
Headers show
Series vfio/type1/pci: IOMMU PFNMAP invalidation | expand

Message

Alex Williamson May 14, 2020, 4:51 p.m. UTC
This is a follow-on series to "vfio-pci: Block user access to disabled
device MMIO"[1], which extends user access blocking of disabled MMIO
ranges to include unmapping the ranges from the IOMMU.  The first patch
adds an invalidation callback path, allowing vfio bus drivers to signal
the IOMMU backend to unmap ranges with vma level granularity.  This
signaling is done both when the MMIO range becomes inaccessible due to
memory disabling, as well as when a vma is closed, making up for the
lack of tracking or pinning for non-page backed vmas.  The second
patch adds registration and testing interfaces such that the IOMMU
backend driver can test whether a given PFNMAP vma is provided by a
vfio bus driver supporting invalidation.  We can then implement more
restricted semantics to only allow PFNMAP DMA mappings when we have
such support, which becomes the new default.

Jason, if you'd like Suggested-by credit for the ideas here I'd be
glad to add it.  Thanks,

Alex

[1]https://lore.kernel.org/kvm/158871401328.15589.17598154478222071285.stgit@gimli.home/

---

Alex Williamson (2):
      vfio: Introduce bus driver to IOMMU invalidation interface
      vfio: Introduce strict PFNMAP mappings


 drivers/vfio/pci/vfio_pci.c         |   41 ++++++++++-
 drivers/vfio/pci/vfio_pci_private.h |    1 
 drivers/vfio/vfio.c                 |   76 ++++++++++++++++++++
 drivers/vfio/vfio_iommu_type1.c     |  130 +++++++++++++++++++++++++++--------
 include/linux/vfio.h                |    9 ++
 5 files changed, 222 insertions(+), 35 deletions(-)

Comments

Peter Xu May 14, 2020, 9:25 p.m. UTC | #1
On Thu, May 14, 2020 at 10:51:46AM -0600, Alex Williamson wrote:
> This is a follow-on series to "vfio-pci: Block user access to disabled
> device MMIO"[1], which extends user access blocking of disabled MMIO
> ranges to include unmapping the ranges from the IOMMU.  The first patch
> adds an invalidation callback path, allowing vfio bus drivers to signal
> the IOMMU backend to unmap ranges with vma level granularity.  This
> signaling is done both when the MMIO range becomes inaccessible due to
> memory disabling, as well as when a vma is closed, making up for the
> lack of tracking or pinning for non-page backed vmas.  The second
> patch adds registration and testing interfaces such that the IOMMU
> backend driver can test whether a given PFNMAP vma is provided by a
> vfio bus driver supporting invalidation.  We can then implement more
> restricted semantics to only allow PFNMAP DMA mappings when we have
> such support, which becomes the new default.

Hi, Alex,

IIUC we'll directly tearing down the IOMMU page table without telling the
userspace for those PFNMAP pages.  I'm thinking whether there be any side
effect on the userspace side when userspace cached these mapping information
somehow.  E.g., is there a way for userspace to know this event?

Currently, QEMU VT-d will maintain all the IOVA mappings for each assigned
device when used with vfio-pci.  In this case, QEMU will probably need to
depend some invalidations sent from the guest (either userspace or kernel)
device drivers to invalidate such IOVA mappings after they got removed from the
hardware IOMMU page table underneath.  I haven't thought deeper on what would
happen if the vIOMMU has got an inconsistent mapping of the real device.

Thanks,
Alex Williamson May 14, 2020, 10:17 p.m. UTC | #2
On Thu, 14 May 2020 17:25:38 -0400
Peter Xu <peterx@redhat.com> wrote:

> On Thu, May 14, 2020 at 10:51:46AM -0600, Alex Williamson wrote:
> > This is a follow-on series to "vfio-pci: Block user access to disabled
> > device MMIO"[1], which extends user access blocking of disabled MMIO
> > ranges to include unmapping the ranges from the IOMMU.  The first patch
> > adds an invalidation callback path, allowing vfio bus drivers to signal
> > the IOMMU backend to unmap ranges with vma level granularity.  This
> > signaling is done both when the MMIO range becomes inaccessible due to
> > memory disabling, as well as when a vma is closed, making up for the
> > lack of tracking or pinning for non-page backed vmas.  The second
> > patch adds registration and testing interfaces such that the IOMMU
> > backend driver can test whether a given PFNMAP vma is provided by a
> > vfio bus driver supporting invalidation.  We can then implement more
> > restricted semantics to only allow PFNMAP DMA mappings when we have
> > such support, which becomes the new default.  
> 
> Hi, Alex,
> 
> IIUC we'll directly tearing down the IOMMU page table without telling the
> userspace for those PFNMAP pages.  I'm thinking whether there be any side
> effect on the userspace side when userspace cached these mapping information
> somehow.  E.g., is there a way for userspace to know this event?
> 
> Currently, QEMU VT-d will maintain all the IOVA mappings for each assigned
> device when used with vfio-pci.  In this case, QEMU will probably need to
> depend some invalidations sent from the guest (either userspace or kernel)
> device drivers to invalidate such IOVA mappings after they got removed from the
> hardware IOMMU page table underneath.  I haven't thought deeper on what would
> happen if the vIOMMU has got an inconsistent mapping of the real device.

Full disclosure, I haven't tested vIOMMU, there might be issues.  Let's
puzzle through this.  Without a vIOMMU the vfio MemoryListener in QEMU
makes use of address_space_memory, which is essentially the vCPU view
of memory.  When the memory bit of a PCI device is disabled, QEMU
correctly removes the MMIO regions of the device from this AddressSpace.
When re-enabled, they get re-added.  In that case what we're doing here
is a little bit redundant, the IOMMU mappings get dropped in response
to the memory bit and the subsequent callback from the MemoryListener
is effectively a no-op since the range is already unmapped.  When the
memory bit is re-enabled, the AddressSpace gets updated, the
MemoryListener fires and we re-fault the mmap as we're re-adding the
IOMMU mapping.

When we have a vIOMMU, the address_space_memory behavior should be the
same as above; the vIOMMU isn't involved in vCPU to device access.  So
I think our concern is explicit mappings, much like vfio itself makes.
That feels like a gap.  When the vma is being closed, I think dropping
those mappings out from under the user is probably still the right
approach and I think this series would still be useful if we only did
that much.  I think this would also address Jason's primary concern.
It's better to get an IOMMU fault from the user trying to access those
mappings than it is to leave them in place.

OTOH, if we're dropping mappings in response to disabling the memory
bit, we're changing a potential disabled MMIO access fault into an
IOMMU fault, where the platform response might very well be fatal in
either case.  Maybe we need to look at a more temporary invalidation
due to the memory enable bit.  If we could remap the range to a kernel
page we could maybe avoid the IOMMU fault and maybe even have a crude
test for whether any data was written to the page while that mapping
was in place (ie. simulating more restricted error handling, though
more asynchronous than done at the platform level).  Let me look into
it.  Thanks,

Alex
Jason Gunthorpe May 14, 2020, 10:24 p.m. UTC | #3
On Thu, May 14, 2020 at 04:17:12PM -0600, Alex Williamson wrote:

> that much.  I think this would also address Jason's primary concern.
> It's better to get an IOMMU fault from the user trying to access those
> mappings than it is to leave them in place.

Yes, there are few options here - if the pages are available for use
by the IOMMU and *asynchronously* someone else revokes them, then the
only way to protect the kernel is to block them from the IOMMUU.

For this to be sane the revokation must be under complete control of
the VFIO user. ie if a user decides to disable MMIO traffic then of
course the IOMMU should block P2P transfer to the MMIO bar. It is user
error to have not disabled those transfers in the first place.

When this is all done inside a guest the whole logic applies. On bare
metal you might get some AER or crash or MCE. In virtualization you'll
get an IOMMU fault.

> due to the memory enable bit.  If we could remap the range to a kernel
> page we could maybe avoid the IOMMU fault and maybe even have a crude
> test for whether any data was written to the page while that mapping
> was in place (ie. simulating more restricted error handling, though
> more asynchronous than done at the platform level).  

I'm not if this makes sense, can't we arrange to directly trap the
IOMMU failure and route it into qemu if that is what is desired?

I'll try to look at this next week, swamped right now

Thanks,
Jason
Alex Williamson May 14, 2020, 10:55 p.m. UTC | #4
On Thu, 14 May 2020 19:24:15 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, May 14, 2020 at 04:17:12PM -0600, Alex Williamson wrote:
> 
> > that much.  I think this would also address Jason's primary concern.
> > It's better to get an IOMMU fault from the user trying to access those
> > mappings than it is to leave them in place.  
> 
> Yes, there are few options here - if the pages are available for use
> by the IOMMU and *asynchronously* someone else revokes them, then the
> only way to protect the kernel is to block them from the IOMMUU.
> 
> For this to be sane the revokation must be under complete control of
> the VFIO user. ie if a user decides to disable MMIO traffic then of
> course the IOMMU should block P2P transfer to the MMIO bar. It is user
> error to have not disabled those transfers in the first place.
> 
> When this is all done inside a guest the whole logic applies. On bare
> metal you might get some AER or crash or MCE. In virtualization you'll
> get an IOMMU fault.
> 
> > due to the memory enable bit.  If we could remap the range to a kernel
> > page we could maybe avoid the IOMMU fault and maybe even have a crude
> > test for whether any data was written to the page while that mapping
> > was in place (ie. simulating more restricted error handling, though
> > more asynchronous than done at the platform level).    
> 
> I'm not if this makes sense, can't we arrange to directly trap the
> IOMMU failure and route it into qemu if that is what is desired?

Can't guarantee it, some systems wire that directly into their
management processor so that they can "protect their users" regardless
of whether they want or need it.  Yay firmware first error handling,
*sigh*.  Thanks,

Alex
Peter Xu May 15, 2020, 3:22 p.m. UTC | #5
On Thu, May 14, 2020 at 04:55:17PM -0600, Alex Williamson wrote:
> > I'm not if this makes sense, can't we arrange to directly trap the
> > IOMMU failure and route it into qemu if that is what is desired?
> 
> Can't guarantee it, some systems wire that directly into their
> management processor so that they can "protect their users" regardless
> of whether they want or need it.  Yay firmware first error handling,
> *sigh*.  Thanks,

Sorry to be slightly out of topic - Alex, does this mean the general approach
of fault reporting from vfio to the userspace is not gonna work too?

Thanks,
Alex Williamson May 15, 2020, 3:54 p.m. UTC | #6
On Fri, 15 May 2020 11:22:51 -0400
Peter Xu <peterx@redhat.com> wrote:

> On Thu, May 14, 2020 at 04:55:17PM -0600, Alex Williamson wrote:
> > > I'm not if this makes sense, can't we arrange to directly trap the
> > > IOMMU failure and route it into qemu if that is what is desired?  
> > 
> > Can't guarantee it, some systems wire that directly into their
> > management processor so that they can "protect their users" regardless
> > of whether they want or need it.  Yay firmware first error handling,
> > *sigh*.  Thanks,  
> 
> Sorry to be slightly out of topic - Alex, does this mean the general approach
> of fault reporting from vfio to the userspace is not gonna work too?

AFAIK these platforms only generate a fatal fault on certain classes of
access which imply a potential for data loss, for example a DMA write to
an invalid PTE entry.  The actual IOMMU page faulting mechanism should
not be affected by this, or at least one would hope.  Thanks,

Alex
Jason Gunthorpe May 20, 2020, 12:23 a.m. UTC | #7
On Thu, May 14, 2020 at 10:51:46AM -0600, Alex Williamson wrote:
> This is a follow-on series to "vfio-pci: Block user access to disabled
> device MMIO"[1], which extends user access blocking of disabled MMIO
> ranges to include unmapping the ranges from the IOMMU.  The first patch
> adds an invalidation callback path, allowing vfio bus drivers to signal
> the IOMMU backend to unmap ranges with vma level granularity.  This
> signaling is done both when the MMIO range becomes inaccessible due to
> memory disabling, as well as when a vma is closed, making up for the
> lack of tracking or pinning for non-page backed vmas.  The second
> patch adds registration and testing interfaces such that the IOMMU
> backend driver can test whether a given PFNMAP vma is provided by a
> vfio bus driver supporting invalidation.  We can then implement more
> restricted semantics to only allow PFNMAP DMA mappings when we have
> such support, which becomes the new default.
> 
> Jason, if you'd like Suggested-by credit for the ideas here I'd be
> glad to add it.  Thanks,

Certainly a Reported-by would be OK

The only thing I don't like here is this makes some P2P DMA mapping
scheme for VMAs with invalidation that is completely private to vfio.

Many of us want this in other subsystems, and there are legimiate uses
for vfio to import BAR memory for P2P from other places than itself.

So I would really rather this be supported by the core kernel in some
way.

That said, this is a bug fix, and we still don't have much agreement
on what the core kernel version should look like, let alone how it
should work with an IOMMU. 

So maybe this goes ahead as is and we can figure out how to replace it
with something general later on?

Jason
Jason Gunthorpe May 20, 2020, 12:24 a.m. UTC | #8
On Thu, May 14, 2020 at 04:55:17PM -0600, Alex Williamson wrote:
> On Thu, 14 May 2020 19:24:15 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Thu, May 14, 2020 at 04:17:12PM -0600, Alex Williamson wrote:
> > 
> > > that much.  I think this would also address Jason's primary concern.
> > > It's better to get an IOMMU fault from the user trying to access those
> > > mappings than it is to leave them in place.  
> > 
> > Yes, there are few options here - if the pages are available for use
> > by the IOMMU and *asynchronously* someone else revokes them, then the
> > only way to protect the kernel is to block them from the IOMMUU.
> > 
> > For this to be sane the revokation must be under complete control of
> > the VFIO user. ie if a user decides to disable MMIO traffic then of
> > course the IOMMU should block P2P transfer to the MMIO bar. It is user
> > error to have not disabled those transfers in the first place.
> > 
> > When this is all done inside a guest the whole logic applies. On bare
> > metal you might get some AER or crash or MCE. In virtualization you'll
> > get an IOMMU fault.
> > 
> > > due to the memory enable bit.  If we could remap the range to a kernel
> > > page we could maybe avoid the IOMMU fault and maybe even have a crude
> > > test for whether any data was written to the page while that mapping
> > > was in place (ie. simulating more restricted error handling, though
> > > more asynchronous than done at the platform level).    
> > 
> > I'm not if this makes sense, can't we arrange to directly trap the
> > IOMMU failure and route it into qemu if that is what is desired?
> 
> Can't guarantee it, some systems wire that directly into their
> management processor so that they can "protect their users" regardless
> of whether they want or need it.  Yay firmware first error handling,
> *sigh*.  Thanks,

I feel like those system should just loose the ability to reliably
mirror IOMMU errors to their guests - trying to emulate it by scanning
memory/etc sounds too horrible.

Jason