Message ID | 20180807193125.30378-1-alex.williamson@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | Balloon inhibit enhancements, vfio restriction | expand |
On Tue, Aug 07, 2018 at 01:31:21PM -0600, Alex Williamson wrote: > v3: > - Drop "nested" term in commit log (David) > - Adopt suggested wording in ccw code (Cornelia) > - Explain balloon inhibitor usage in vfio common (Peter) > - Fix to call inhibitor prior to re-using existing containers > to avoid gap that pinning may have occurred in set container > ioctl (self) - Peter, this change is the reason I didn't > include your R-b. > - Add R-b to patches 1 & 2 > > v2: > - Use atomic ops for balloon inhibit counter (Peter) > - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by > default, vfio-pci opt-in by device option, only allowed for mdev > devices, no support added for platform as there are no platform > mdev devices. > > See patch 3/4 for detailed explanation why ballooning and device > assignment typically don't mix. If this eventually changes, flags > on the iommu info struct or perhaps device info struct can inform > us for automatic opt-in. Thanks, > > Alex One of the issues with pass-through is that it breaks overcommit through swap. ballooning seems to offer one solution, instead of making it work this patch just attempts to block ballooning. I guess it's better than corrupting memory but I personally find this approach disappointing. > Alex Williamson (4): > balloon: Allow multiple inhibit users > kvm: Use inhibit to prevent ballooning without synchronous mmu > vfio: Inhibit ballooning based on group attachment to a container > vfio/ccw/pci: Allow devices to opt-in for ballooning > > accel/kvm/kvm-all.c | 4 +++ > balloon.c | 13 ++++++--- > hw/vfio/ccw.c | 9 +++++++ > hw/vfio/common.c | 51 +++++++++++++++++++++++++++++++++++ > hw/vfio/pci.c | 26 +++++++++++++++++- > hw/vfio/trace-events | 1 + > hw/virtio/virtio-balloon.c | 4 +-- > include/hw/vfio/vfio-common.h | 2 ++ > 8 files changed, 103 insertions(+), 7 deletions(-) > > -- > 2.18.0
On Tue, 7 Aug 2018 22:44:56 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Tue, Aug 07, 2018 at 01:31:21PM -0600, Alex Williamson wrote: > > v3: > > - Drop "nested" term in commit log (David) > > - Adopt suggested wording in ccw code (Cornelia) > > - Explain balloon inhibitor usage in vfio common (Peter) > > - Fix to call inhibitor prior to re-using existing containers > > to avoid gap that pinning may have occurred in set container > > ioctl (self) - Peter, this change is the reason I didn't > > include your R-b. > > - Add R-b to patches 1 & 2 > > > > v2: > > - Use atomic ops for balloon inhibit counter (Peter) > > - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by > > default, vfio-pci opt-in by device option, only allowed for mdev > > devices, no support added for platform as there are no platform > > mdev devices. > > > > See patch 3/4 for detailed explanation why ballooning and device > > assignment typically don't mix. If this eventually changes, flags > > on the iommu info struct or perhaps device info struct can inform > > us for automatic opt-in. Thanks, > > > > Alex > > One of the issues with pass-through is that it breaks overcommit > through swap. ballooning seems to offer one solution, instead of > making it work this patch just attempts to block ballooning. > > I guess it's better than corrupting memory but I personally find this > approach disappointing. Memory hotplug is the way to achieve variable density with assigned device VMs, otherwise look towards approaches like mdev and shared virtual addresses with PASID support. We cannot shoehorn page faulting without both hardware and software support. Some class of "legacy" device assignment will always have this incompatibility. Thanks, Alex
On Tue, Aug 07, 2018 at 01:53:03PM -0600, Alex Williamson wrote: > On Tue, 7 Aug 2018 22:44:56 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Tue, Aug 07, 2018 at 01:31:21PM -0600, Alex Williamson wrote: > > > v3: > > > - Drop "nested" term in commit log (David) > > > - Adopt suggested wording in ccw code (Cornelia) > > > - Explain balloon inhibitor usage in vfio common (Peter) > > > - Fix to call inhibitor prior to re-using existing containers > > > to avoid gap that pinning may have occurred in set container > > > ioctl (self) - Peter, this change is the reason I didn't > > > include your R-b. > > > - Add R-b to patches 1 & 2 > > > > > > v2: > > > - Use atomic ops for balloon inhibit counter (Peter) > > > - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by > > > default, vfio-pci opt-in by device option, only allowed for mdev > > > devices, no support added for platform as there are no platform > > > mdev devices. > > > > > > See patch 3/4 for detailed explanation why ballooning and device > > > assignment typically don't mix. If this eventually changes, flags > > > on the iommu info struct or perhaps device info struct can inform > > > us for automatic opt-in. Thanks, > > > > > > Alex > > > > One of the issues with pass-through is that it breaks overcommit > > through swap. ballooning seems to offer one solution, instead of > > making it work this patch just attempts to block ballooning. > > > > I guess it's better than corrupting memory but I personally find this > > approach disappointing. > > Memory hotplug is the way to achieve variable density with assigned > device VMs, otherwise look towards approaches like mdev and shared > virtual addresses with PASID support. We cannot shoehorn page faulting > without both hardware and software support. Some class of "legacy" > device assignment will always have this incompatibility. Thanks, > > Alex I'm not sure I agree. At least with VTD, it seems entirely possible to change e.g. a PMD atomically to point to a different set of PTEs, then flush. That will allow removing memory at high granularity for an arbitrary device without mdev or PASID dependency. I suspect most IOMMUs are like this. IIUC doing that within guest right now will cause a range to be unmapped and them mapped again, which I suspect only works if we are lucky and device does not access the range during this time. So at some level it's a theoretical bug we would do well to fix, and then we can support ballooning better.
On Wed, 8 Aug 2018 00:58:32 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Tue, Aug 07, 2018 at 01:53:03PM -0600, Alex Williamson wrote: > > On Tue, 7 Aug 2018 22:44:56 +0300 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > On Tue, Aug 07, 2018 at 01:31:21PM -0600, Alex Williamson wrote: > > > > v3: > > > > - Drop "nested" term in commit log (David) > > > > - Adopt suggested wording in ccw code (Cornelia) > > > > - Explain balloon inhibitor usage in vfio common (Peter) > > > > - Fix to call inhibitor prior to re-using existing containers > > > > to avoid gap that pinning may have occurred in set container > > > > ioctl (self) - Peter, this change is the reason I didn't > > > > include your R-b. > > > > - Add R-b to patches 1 & 2 > > > > > > > > v2: > > > > - Use atomic ops for balloon inhibit counter (Peter) > > > > - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by > > > > default, vfio-pci opt-in by device option, only allowed for mdev > > > > devices, no support added for platform as there are no platform > > > > mdev devices. > > > > > > > > See patch 3/4 for detailed explanation why ballooning and device > > > > assignment typically don't mix. If this eventually changes, flags > > > > on the iommu info struct or perhaps device info struct can inform > > > > us for automatic opt-in. Thanks, > > > > > > > > Alex > > > > > > One of the issues with pass-through is that it breaks overcommit > > > through swap. ballooning seems to offer one solution, instead of > > > making it work this patch just attempts to block ballooning. > > > > > > I guess it's better than corrupting memory but I personally find this > > > approach disappointing. > > > > Memory hotplug is the way to achieve variable density with assigned > > device VMs, otherwise look towards approaches like mdev and shared > > virtual addresses with PASID support. We cannot shoehorn page faulting > > without both hardware and software support. Some class of "legacy" > > device assignment will always have this incompatibility. Thanks, > > > > Alex > > I'm not sure I agree. > > At least with VTD, it seems entirely possible to change e.g. a PMD > atomically to point to a different set of PTEs, then flush. > That will allow removing memory at high granularity for > an arbitrary device without mdev or PASID dependency. > > I suspect most IOMMUs are like this. > > IIUC doing that within guest right now will cause a range to be unmapped > and them mapped again, which I suspect only works if we are lucky and > device does not access the range during this time. > > So at some level it's a theoretical bug we would do well to fix, > and then we can support ballooning better. Being able to unmap the page atomically from the IOMMU is one aspect, the other is re-mapping the page when the balloon is deflated, which is currently done only via a page fault. We cannot guarantee that a vCPU will touch a page before the IO device does, so something needs to fault in that page for the IOMMU. So we have: - How do we handle re-mapping pages as the balloon is deflated? - IOMMU page faults? Requires PRI, IOMMU & endpoint support. - Some new MMU notifier hook? Not sure WILLNEED is appropriate here. - How do we handle un-mapping pages as the balloon is inflated? - Rewrite the kernel IOMMU API and IOMMU drivers to allow unmapping sub-pages within previous mappings. - MMU notifier hook to trigger above non-existent code? - Alternatively, sacrificing IOTLB performance and probably kernel bloat by using only PAGE_SIZE IOMMU mappings. Maybe some of these will evolve over time, SVA efforts are working on some of these interfaces, but apparently device assignment users have been getting along just fine without ballooning for many years. With physical devices, or even modern VFs, it seems hard to push density beyond what we can handle with memory hotplug. Perhaps as we get into scalable IOV type approaches we can opt-in more mediated devices by default. It seems like we're just going around in circles here though, anything more than preventing QEMU from shooting itself is a long term goal touching multiple levels of the stack. Thanks, Alex
On Tue, Aug 07, 2018 at 04:40:33PM -0600, Alex Williamson wrote: > Maybe some of these will evolve over time, SVA efforts are working on > some of these interfaces, but apparently device assignment users have > been getting along just fine without ballooning for many years. But not any more I think. It takes all the running you can do, to keep in the same place. Overcommit with device specific drivers is one of the things that containers do better than VMs. If VMs had a better overcommit story with PT devices, it would be interesting IMHO. > With > physical devices, or even modern VFs, it seems hard to push density > beyond what we can handle with memory hotplug. Perhaps as we get into > scalable IOV type approaches we can opt-in more mediated devices by > default. I'm not sure what does mediated have to do with it though. It seems weird to fix internal Linux or even system call interfaces being inadequate with custom hardware. > It seems like we're just going around in circles here though, > anything more than preventing QEMU from shooting itself is a long term > goal touching multiple levels of the stack. It's just QEMU and the kernel, I don't see why any other levels would be involved. And it looks like we both agree it is a bug in the current VTD emulation even though current guests do not trigger it. I agree it's more work than just blocking things out, I am not making an argument for nacking this specific patch, but I do hope this thread motivates someone to look into it.
On Wed, Aug 08, 2018 at 12:58:32AM +0300, Michael S. Tsirkin wrote: > At least with VTD, it seems entirely possible to change e.g. a PMD > atomically to point to a different set of PTEs, then flush. > That will allow removing memory at high granularity for > an arbitrary device without mdev or PASID dependency. My understanding is that the guest driver should prohibit this kind of operation (say, modifying PMD). Actually I don't see how it can happen in Linux if the kernel drivers always call the IOMMU API since there are only map/unmap APIs rather than this atomic-modify API. The thing is that IMHO it's the guest driver's responsibility to make sure the pages will never be used by the device before it removes the entry (including modifying the PMD since that actually removes all the entries on the old PMD). If not, I would see it a guest kernel bug instead of the bug in the emulation code. Thanks,
On Wed, 8 Aug 2018 11:45:43 +0800 Peter Xu <peterx@redhat.com> wrote: > On Wed, Aug 08, 2018 at 12:58:32AM +0300, Michael S. Tsirkin wrote: > > At least with VTD, it seems entirely possible to change e.g. a PMD > > atomically to point to a different set of PTEs, then flush. > > That will allow removing memory at high granularity for > > an arbitrary device without mdev or PASID dependency. > > My understanding is that the guest driver should prohibit this kind of > operation (say, modifying PMD). There's currently no need for this sort of operation within the dma api and the iommu api doesn't offer it either. > Actually I don't see how it can > happen in Linux if the kernel drivers always call the IOMMU API since > there are only map/unmap APIs rather than this atomic-modify API. Exactly, the vfio dma mapping api is just an extension of the iommu api and there's only map and unmap. Furthermore, unmap can currently return more than requested if the original mapping made use of superpages in the iommu, so the only way to achieve page level granularity is to make only page size mappings. Otherwise we're talking about new apis across the board. > The thing is that IMHO it's the guest driver's responsibility to make > sure the pages will never be used by the device before it removes the > entry (including modifying the PMD since that actually removes all the > entries on the old PMD). If not, I would see it a guest kernel bug > instead of the bug in the emulation code. This is why there is no atomic modify in the dma api, we have drivers that directly manage the buffers for a device and know when it's in use and when it's not. There's never a need, currently, to replace the iova mapping for a single page within a larger buffer. Maybe the dma api could also find use for it, but it seems more unique to the iommu api that we have a "buffer", which happens to be a contiguous RAM region for the VM, where we do want to change the mapping of a single page. That single page might currently be mapped by a 2MB or 1GB page in the case of Intel, or by an arbitrary page size in the case of AMD. vfio is the driver managing these mappings, but versus the dma api, we don't have any insight to the device behavior, including inflight dma. We can stop all dma for the device, but not without interfering and potentially breaking the behavior of the device. So again, I think this comes down to new iommu driver support and new iommu apis and new vfio apis to enable some sort of atomic update interface, or sacrificing performance and adding bloat by forcing page size mappings. Thanks, Alex
On Wed, Aug 08, 2018 at 04:23:04PM -0600, Alex Williamson wrote: > So again, I think this comes down to new iommu driver support and new > iommu apis and new vfio apis to enable some sort of atomic update > interface, Oh absolutely. My point is some guest OS can start using atomic updates at any time since it's something IOMMU hardware supports. Adherence to a hardware spec would be preferable to adherence to an internal Linux API. I appreciate it's not an easy task involving host Linux and QEMU changes.
On Wed, Aug 08, 2018 at 11:45:43AM +0800, Peter Xu wrote: > On Wed, Aug 08, 2018 at 12:58:32AM +0300, Michael S. Tsirkin wrote: > > At least with VTD, it seems entirely possible to change e.g. a PMD > > atomically to point to a different set of PTEs, then flush. > > That will allow removing memory at high granularity for > > an arbitrary device without mdev or PASID dependency. > > My understanding is that the guest driver should prohibit this kind of > operation (say, modifying PMD). Interesting. Which part of the VTD spec prohibits this? > Actually I don't see how it can > happen in Linux if the kernel drivers always call the IOMMU API since > there are only map/unmap APIs rather than this atomic-modify API. It could happen with a non-Linux guest which might have a different API. > The thing is that IMHO it's the guest driver's responsibility to make > sure the pages will never be used by the device before it removes the > entry (including modifying the PMD since that actually removes all the > entries on the old PMD). If you switch PMDs atomically from one set of valid PTEs to another, then flush, then as far as I could see it just works in the hardware VTD, but not in the emulated VTD. So that's a difference in behaviour. Maybe we are lucky and no one does that. > If not, I would see it a guest kernel bug > instead of the bug in the emulation code. > > Thanks, > > -- > Peter Xu
On Thu, Aug 09, 2018 at 12:23:43PM +0300, Michael S. Tsirkin wrote: > On Wed, Aug 08, 2018 at 11:45:43AM +0800, Peter Xu wrote: > > On Wed, Aug 08, 2018 at 12:58:32AM +0300, Michael S. Tsirkin wrote: > > > At least with VTD, it seems entirely possible to change e.g. a PMD > > > atomically to point to a different set of PTEs, then flush. > > > That will allow removing memory at high granularity for > > > an arbitrary device without mdev or PASID dependency. > > > > My understanding is that the guest driver should prohibit this kind of > > operation (say, modifying PMD). > > Interesting. Which part of the VTD spec prohibits this? > > > Actually I don't see how it can > > happen in Linux if the kernel drivers always call the IOMMU API since > > there are only map/unmap APIs rather than this atomic-modify API. > > It could happen with a non-Linux guest which might have a different API. > > > The thing is that IMHO it's the guest driver's responsibility to make > > sure the pages will never be used by the device before it removes the > > entry (including modifying the PMD since that actually removes all the > > entries on the old PMD). > > If you switch PMDs atomically from one set of valid PTEs to another, > then flush, then as far as I could see it just works in the hardware > VTD, but not in the emulated VTD. So that's a difference in > behaviour. Maybe we are lucky and no one does that. Yes, but AFAICT that's also the best we can have now since the userspace QEMU (or say, the VT-d emulation code) cannot really modify a real PMD that the hardware uses - it can only call the VFIO APIs, and finally it boils down again to the host kernel IOMMU APIs to do map or unmap only. So it's a impossible task until we provide such an interface through the whole IOMMU/VFIO/... stack just like what you have discussed in the other thread. Thanks,
On Thu, Aug 09, 2018 at 05:37:58PM +0800, Peter Xu wrote: > On Thu, Aug 09, 2018 at 12:23:43PM +0300, Michael S. Tsirkin wrote: > > On Wed, Aug 08, 2018 at 11:45:43AM +0800, Peter Xu wrote: > > > On Wed, Aug 08, 2018 at 12:58:32AM +0300, Michael S. Tsirkin wrote: > > > > At least with VTD, it seems entirely possible to change e.g. a PMD > > > > atomically to point to a different set of PTEs, then flush. > > > > That will allow removing memory at high granularity for > > > > an arbitrary device without mdev or PASID dependency. > > > > > > My understanding is that the guest driver should prohibit this kind of > > > operation (say, modifying PMD). > > > > Interesting. Which part of the VTD spec prohibits this? > > > > > Actually I don't see how it can > > > happen in Linux if the kernel drivers always call the IOMMU API since > > > there are only map/unmap APIs rather than this atomic-modify API. > > > > It could happen with a non-Linux guest which might have a different API. > > > > > The thing is that IMHO it's the guest driver's responsibility to make > > > sure the pages will never be used by the device before it removes the > > > entry (including modifying the PMD since that actually removes all the > > > entries on the old PMD). > > > > If you switch PMDs atomically from one set of valid PTEs to another, > > then flush, then as far as I could see it just works in the hardware > > VTD, but not in the emulated VTD. So that's a difference in > > behaviour. Maybe we are lucky and no one does that. > > Yes, but AFAICT that's also the best we can have now since the > userspace QEMU (or say, the VT-d emulation code) cannot really modify > a real PMD that the hardware uses - it can only call the VFIO APIs, > and finally it boils down again to the host kernel IOMMU APIs to do > map or unmap only. So it's a impossible task until we provide such an > interface through the whole IOMMU/VFIO/... stack just like what you > have discussed in the other thread. > > Thanks, This would need host kernel support, yes. > -- > Peter Xu