diff mbox

[RFC] Fix superpage unmap on Intel IOMMU

Message ID 1307129491.23418.4.camel@i7.infradead.org (mailing list archive)
State New, archived
Headers show

Commit Message

David Woodhouse June 3, 2011, 7:31 p.m. UTC
Tell me it isn't so...

Comments

David Woodhouse June 3, 2011, 8 p.m. UTC | #1
On Fri, 2011-06-03 at 20:31 +0100, David Woodhouse wrote:
> 
> +               size = PAGE_SIZE << gfp_order; 

Just realised it needs an extra <<9 in there somewhere, if you actually
wanted to test it (Allen). I think I'd prefer to fix the KVM code
instead, though.
Alex Williamson June 3, 2011, 8:01 p.m. UTC | #2
On Fri, 2011-06-03 at 20:31 +0100, David Woodhouse wrote:
> Tell me it isn't so...

Looks accurate to me, in fact, with hugetlbfs it seems like it's doing
exactly what it should do.  The non-hugetlbfs case isn't efficient, but
it isn't wrong either.  Our only other option is to figure out what's
contiguous, map those chunks and try to keep track of all that.  It's
much easier to make you do it since you have to manage the iommu page
table anyway ;)  Besides, it's the only way we get page granularity on
unmapping.

And of course, isn't this just following the IOMMU API?  It'd be nice if
that API was documented somewhere.  Thanks,

Alex

> diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
> index 59f17ac..6c588ee 100644
> --- a/drivers/pci/intel-iommu.c
> +++ b/drivers/pci/intel-iommu.c
> @@ -3866,6 +3866,20 @@ static int intel_iommu_unmap(struct iommu_domain *domain,
>  	struct dmar_domain *dmar_domain = domain->priv;
>  	size_t size = PAGE_SIZE << gfp_order;
>  
> +	/* The KVM code is *fucked* in the head. It maps the range
> +	   one page at a time, using 4KiB pages unless it actually
> +	   allocated hugepages using hugetlbfs. (So we get to flush
> +	   the CPU data cache and then the IOTLB for each page in
> +	   its loop). And on unmap, it unmaps 4KiB at a time (always
> +	   passing gfp_order==0), regardless of whether it mapped
> +	   using superpages or not. So on unmap, if we detect a
> +	   superpage in our page tables we are expected to unmap
> +	   *more* than we are asked to, and return a value indicating
> +	   how much we actually unmapped. WTF? */
> +	if (dma_pfn_level_pte (dmar_domain, iova >> VTD_PAGE_SHIFT,
> +			       1, &gfp_order))
> +		size = PAGE_SIZE << gfp_order;
> +	    
>  	dma_pte_clear_range(dmar_domain, iova >> VTD_PAGE_SHIFT,
>  			    (iova + size - 1) >> VTD_PAGE_SHIFT);
>  
> 



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Woodhouse June 4, 2011, 9 a.m. UTC | #3
On Fri, 3 Jun 2011, Alex Williamson wrote:

> On Fri, 2011-06-03 at 20:31 +0100, David Woodhouse wrote:
> > Tell me it isn't so...
> 
> Looks accurate to me, in fact, with hugetlbfs it seems like it's doing
> exactly what it should do.  The non-hugetlbfs case isn't efficient, but
> it isn't wrong either.  Our only other option is to figure out what's
> contiguous, map those chunks and try to keep track of all that. 

That's not hard. You're already iterating over the pages. Instead of 
unconditionally mapping one page at a time, you have simple logic in the 
loop to do something like:
	if (this_hpa == first_hpa + nr_pages * page_size)
		nr_pages++;
	else {
		iommu_map(first_hpa, nr_pages);
		first_hpa = this_hpa;
		nr_pages = 1;
	}

This gives you a fairly simple way to spot contiguous ranges, avoid *some* 
of the extra cache and IOTLB flushes, and perhaps give us a chance to 
*opportunistically* use superpages. (You could avoid all the extra flushes 
if you give us a sglist, but that's probably more hassle than it's worth.)

Of course, if we do opportunistically use superpages, we're going to have 
to be able to break them. Currently your API just says "unmap whatever 
page size you happen to find in the PTE here, and tell me what it was", 
which will hurt if you really only mean to unmap 4KiB for ballooning, when 
we'd spotted that we could map a whole 2MiB page there.

(Unless the IOMMU keeps track *separately* of the page size used when 
mapping each range, somewhere outside the page tables themselves. But no. 
You *have* that information, in kvm_host_page_size(vma). So give it.)
Joerg Roedel June 6, 2011, 9:50 a.m. UTC | #4
On Fri, Jun 03, 2011 at 03:31:30PM -0400, David Woodhouse wrote:

> +	/* The KVM code is *fucked* in the head. It maps the range
> +	   one page at a time, using 4KiB pages unless it actually
> +	   allocated hugepages using hugetlbfs.

This is acutally by design. The IOMMU driver is not responsible for
finding out details like the continuity of the memory it has to map. It
just gets the physical address and the size and maps it into its
page-tables. Any details beside that are the business of KVM, and since
KVM allocates guest memory in user-space the only way to find out the
size of the physically contiguous region is the page-size of the vma.

> +	   (So we get to flush the CPU data cache and then
> +	    the IOTLB for each page in its loop).

The best way to avoid this is to introduce some kind of
iommu_domain_commit function which KVM calls in the end and which just
does a singe iotlb flush (and a dcache flush on VT-d).

This is at least better than moving more functionality into the IOMMU
driver (for example by teaching the driver to map a vma region). The
mapping code should stay common between IOMMU drivers as it is now.

> +	   And on unmap, it unmaps 4KiB at a time (always
> +	   passing gfp_order==0), regardless of whether it mapped
> +	   using superpages or not. So on unmap, if we detect a
> +	   superpage in our page tables we are expected to unmap
> +	   *more* than we are asked to, and return a value indicating
> +	   how much we actually unmapped. WTF? */

That is actually a weak point, I agree. The interface isn't very
consistent here. The idea was that KVM does not need to care which
page-size was used to map.

Probably we can clean that interface up and allow the ->unmap call to
only unmap the requested size, forcing to split a page if needed. That
makes the interface a lot cleaner and consistent, but the KVM IOMMU
unmap path a lot more costly.

But all KVM wants to do is to unmap everything from the io-page-table
and unpin the pages. We can hide the unmapping part in IOMMU domain
destruction and KVM only unpins its pages.

Thoughts?

Regards,

	Joerg
diff mbox

Patch

diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index 59f17ac..6c588ee 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -3866,6 +3866,20 @@  static int intel_iommu_unmap(struct iommu_domain *domain,
 	struct dmar_domain *dmar_domain = domain->priv;
 	size_t size = PAGE_SIZE << gfp_order;
 
+	/* The KVM code is *fucked* in the head. It maps the range
+	   one page at a time, using 4KiB pages unless it actually
+	   allocated hugepages using hugetlbfs. (So we get to flush
+	   the CPU data cache and then the IOTLB for each page in
+	   its loop). And on unmap, it unmaps 4KiB at a time (always
+	   passing gfp_order==0), regardless of whether it mapped
+	   using superpages or not. So on unmap, if we detect a
+	   superpage in our page tables we are expected to unmap
+	   *more* than we are asked to, and return a value indicating
+	   how much we actually unmapped. WTF? */
+	if (dma_pfn_level_pte (dmar_domain, iova >> VTD_PAGE_SHIFT,
+			       1, &gfp_order))
+		size = PAGE_SIZE << gfp_order;
+	    
 	dma_pte_clear_range(dmar_domain, iova >> VTD_PAGE_SHIFT,
 			    (iova + size - 1) >> VTD_PAGE_SHIFT);