Message ID | 1307129491.23418.4.camel@i7.infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, 2011-06-03 at 20:31 +0100, David Woodhouse wrote: > > + size = PAGE_SIZE << gfp_order; Just realised it needs an extra <<9 in there somewhere, if you actually wanted to test it (Allen). I think I'd prefer to fix the KVM code instead, though.
On Fri, 2011-06-03 at 20:31 +0100, David Woodhouse wrote: > Tell me it isn't so... Looks accurate to me, in fact, with hugetlbfs it seems like it's doing exactly what it should do. The non-hugetlbfs case isn't efficient, but it isn't wrong either. Our only other option is to figure out what's contiguous, map those chunks and try to keep track of all that. It's much easier to make you do it since you have to manage the iommu page table anyway ;) Besides, it's the only way we get page granularity on unmapping. And of course, isn't this just following the IOMMU API? It'd be nice if that API was documented somewhere. Thanks, Alex > diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c > index 59f17ac..6c588ee 100644 > --- a/drivers/pci/intel-iommu.c > +++ b/drivers/pci/intel-iommu.c > @@ -3866,6 +3866,20 @@ static int intel_iommu_unmap(struct iommu_domain *domain, > struct dmar_domain *dmar_domain = domain->priv; > size_t size = PAGE_SIZE << gfp_order; > > + /* The KVM code is *fucked* in the head. It maps the range > + one page at a time, using 4KiB pages unless it actually > + allocated hugepages using hugetlbfs. (So we get to flush > + the CPU data cache and then the IOTLB for each page in > + its loop). And on unmap, it unmaps 4KiB at a time (always > + passing gfp_order==0), regardless of whether it mapped > + using superpages or not. So on unmap, if we detect a > + superpage in our page tables we are expected to unmap > + *more* than we are asked to, and return a value indicating > + how much we actually unmapped. WTF? */ > + if (dma_pfn_level_pte (dmar_domain, iova >> VTD_PAGE_SHIFT, > + 1, &gfp_order)) > + size = PAGE_SIZE << gfp_order; > + > dma_pte_clear_range(dmar_domain, iova >> VTD_PAGE_SHIFT, > (iova + size - 1) >> VTD_PAGE_SHIFT); > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 3 Jun 2011, Alex Williamson wrote: > On Fri, 2011-06-03 at 20:31 +0100, David Woodhouse wrote: > > Tell me it isn't so... > > Looks accurate to me, in fact, with hugetlbfs it seems like it's doing > exactly what it should do. The non-hugetlbfs case isn't efficient, but > it isn't wrong either. Our only other option is to figure out what's > contiguous, map those chunks and try to keep track of all that. That's not hard. You're already iterating over the pages. Instead of unconditionally mapping one page at a time, you have simple logic in the loop to do something like: if (this_hpa == first_hpa + nr_pages * page_size) nr_pages++; else { iommu_map(first_hpa, nr_pages); first_hpa = this_hpa; nr_pages = 1; } This gives you a fairly simple way to spot contiguous ranges, avoid *some* of the extra cache and IOTLB flushes, and perhaps give us a chance to *opportunistically* use superpages. (You could avoid all the extra flushes if you give us a sglist, but that's probably more hassle than it's worth.) Of course, if we do opportunistically use superpages, we're going to have to be able to break them. Currently your API just says "unmap whatever page size you happen to find in the PTE here, and tell me what it was", which will hurt if you really only mean to unmap 4KiB for ballooning, when we'd spotted that we could map a whole 2MiB page there. (Unless the IOMMU keeps track *separately* of the page size used when mapping each range, somewhere outside the page tables themselves. But no. You *have* that information, in kvm_host_page_size(vma). So give it.)
On Fri, Jun 03, 2011 at 03:31:30PM -0400, David Woodhouse wrote: > + /* The KVM code is *fucked* in the head. It maps the range > + one page at a time, using 4KiB pages unless it actually > + allocated hugepages using hugetlbfs. This is acutally by design. The IOMMU driver is not responsible for finding out details like the continuity of the memory it has to map. It just gets the physical address and the size and maps it into its page-tables. Any details beside that are the business of KVM, and since KVM allocates guest memory in user-space the only way to find out the size of the physically contiguous region is the page-size of the vma. > + (So we get to flush the CPU data cache and then > + the IOTLB for each page in its loop). The best way to avoid this is to introduce some kind of iommu_domain_commit function which KVM calls in the end and which just does a singe iotlb flush (and a dcache flush on VT-d). This is at least better than moving more functionality into the IOMMU driver (for example by teaching the driver to map a vma region). The mapping code should stay common between IOMMU drivers as it is now. > + And on unmap, it unmaps 4KiB at a time (always > + passing gfp_order==0), regardless of whether it mapped > + using superpages or not. So on unmap, if we detect a > + superpage in our page tables we are expected to unmap > + *more* than we are asked to, and return a value indicating > + how much we actually unmapped. WTF? */ That is actually a weak point, I agree. The interface isn't very consistent here. The idea was that KVM does not need to care which page-size was used to map. Probably we can clean that interface up and allow the ->unmap call to only unmap the requested size, forcing to split a page if needed. That makes the interface a lot cleaner and consistent, but the KVM IOMMU unmap path a lot more costly. But all KVM wants to do is to unmap everything from the io-page-table and unpin the pages. We can hide the unmapping part in IOMMU domain destruction and KVM only unpins its pages. Thoughts? Regards, Joerg
diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c index 59f17ac..6c588ee 100644 --- a/drivers/pci/intel-iommu.c +++ b/drivers/pci/intel-iommu.c @@ -3866,6 +3866,20 @@ static int intel_iommu_unmap(struct iommu_domain *domain, struct dmar_domain *dmar_domain = domain->priv; size_t size = PAGE_SIZE << gfp_order; + /* The KVM code is *fucked* in the head. It maps the range + one page at a time, using 4KiB pages unless it actually + allocated hugepages using hugetlbfs. (So we get to flush + the CPU data cache and then the IOTLB for each page in + its loop). And on unmap, it unmaps 4KiB at a time (always + passing gfp_order==0), regardless of whether it mapped + using superpages or not. So on unmap, if we detect a + superpage in our page tables we are expected to unmap + *more* than we are asked to, and return a value indicating + how much we actually unmapped. WTF? */ + if (dma_pfn_level_pte (dmar_domain, iova >> VTD_PAGE_SHIFT, + 1, &gfp_order)) + size = PAGE_SIZE << gfp_order; + dma_pte_clear_range(dmar_domain, iova >> VTD_PAGE_SHIFT, (iova + size - 1) >> VTD_PAGE_SHIFT);