Message ID | 20210203233709.19819-6-dongli.zhang@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | swiotlb: 64-bit DMA buffer | expand |
So one thing that has been on my mind for a while: I'd really like to kill the separate dma ops in Xen swiotlb. If we compare xen-swiotlb to swiotlb the main difference seems to be: - additional reasons to bounce I/O vs the plain DMA capable - the possibility to do a hypercall on arm/arm64 - an extra translation layer before doing the phys_to_dma and vice versa - an special memory allocator I wonder if inbetween a few jump labels or other no overhead enablement options and possibly better use of the dma_range_map we could kill off most of swiotlb-xen instead of maintaining all this code duplication?
On Thu, Feb 04, 2021 at 09:40:23AM +0100, Christoph Hellwig wrote: > So one thing that has been on my mind for a while: I'd really like > to kill the separate dma ops in Xen swiotlb. If we compare xen-swiotlb > to swiotlb the main difference seems to be: > > - additional reasons to bounce I/O vs the plain DMA capable > - the possibility to do a hypercall on arm/arm64 > - an extra translation layer before doing the phys_to_dma and vice > versa > - an special memory allocator > > I wonder if inbetween a few jump labels or other no overhead enablement > options and possibly better use of the dma_range_map we could kill > off most of swiotlb-xen instead of maintaining all this code duplication? So I looked at this a bit more. For x86 with XENFEAT_auto_translated_physmap (how common is that?) pfn_to_gfn is a nop, so plain phys_to_dma/dma_to_phys do work as-is. xen_arch_need_swiotlb always returns true for x86, and range_straddles_page_boundary should never be true for the XENFEAT_auto_translated_physmap case. So as far as I can tell the mapping fast path for the XENFEAT_auto_translated_physmap can be trivially reused from swiotlb. That leaves us with the next more complicated case, x86 or fully cache coherent arm{,64} without XENFEAT_auto_translated_physmap. In that case we need to patch in a phys_to_dma/dma_to_phys that performs the MFN lookup, which could be done using alternatives or jump labels. I think if that is done right we should also be able to let that cover the foreign pages in is_xen_swiotlb_buffer/is_swiotlb_buffer, but in that worst case that would need another alternative / jump label. For non-coherent arm{,64} we'd also need to use alternatives or jump labels to for the cache maintainance ops, but that isn't a hard problem either.
On Sun, Feb 07, 2021 at 04:56:01PM +0100, Christoph Hellwig wrote: > On Thu, Feb 04, 2021 at 09:40:23AM +0100, Christoph Hellwig wrote: > > So one thing that has been on my mind for a while: I'd really like > > to kill the separate dma ops in Xen swiotlb. If we compare xen-swiotlb > > to swiotlb the main difference seems to be: > > > > - additional reasons to bounce I/O vs the plain DMA capable > > - the possibility to do a hypercall on arm/arm64 > > - an extra translation layer before doing the phys_to_dma and vice > > versa > > - an special memory allocator > > > > I wonder if inbetween a few jump labels or other no overhead enablement > > options and possibly better use of the dma_range_map we could kill > > off most of swiotlb-xen instead of maintaining all this code duplication? > > So I looked at this a bit more. > > For x86 with XENFEAT_auto_translated_physmap (how common is that?) Juergen, Boris please correct me if I am wrong, but that XENFEAT_auto_translated_physmap only works for PVH guests? > pfn_to_gfn is a nop, so plain phys_to_dma/dma_to_phys do work as-is. > > xen_arch_need_swiotlb always returns true for x86, and > range_straddles_page_boundary should never be true for the > XENFEAT_auto_translated_physmap case. Correct. The kernel should have no clue of what the real MFNs are for PFNs. > > So as far as I can tell the mapping fast path for the > XENFEAT_auto_translated_physmap can be trivially reused from swiotlb. > > That leaves us with the next more complicated case, x86 or fully cache > coherent arm{,64} without XENFEAT_auto_translated_physmap. In that case > we need to patch in a phys_to_dma/dma_to_phys that performs the MFN > lookup, which could be done using alternatives or jump labels. > I think if that is done right we should also be able to let that cover > the foreign pages in is_xen_swiotlb_buffer/is_swiotlb_buffer, but > in that worst case that would need another alternative / jump label. > > For non-coherent arm{,64} we'd also need to use alternatives or jump > labels to for the cache maintainance ops, but that isn't a hard problem > either. > >
On 2/19/21 3:32 PM, Konrad Rzeszutek Wilk wrote: > On Sun, Feb 07, 2021 at 04:56:01PM +0100, Christoph Hellwig wrote: >> On Thu, Feb 04, 2021 at 09:40:23AM +0100, Christoph Hellwig wrote: >>> So one thing that has been on my mind for a while: I'd really like >>> to kill the separate dma ops in Xen swiotlb. If we compare xen-swiotlb >>> to swiotlb the main difference seems to be: >>> >>> - additional reasons to bounce I/O vs the plain DMA capable >>> - the possibility to do a hypercall on arm/arm64 >>> - an extra translation layer before doing the phys_to_dma and vice >>> versa >>> - an special memory allocator >>> >>> I wonder if inbetween a few jump labels or other no overhead enablement >>> options and possibly better use of the dma_range_map we could kill >>> off most of swiotlb-xen instead of maintaining all this code duplication? >> So I looked at this a bit more. >> >> For x86 with XENFEAT_auto_translated_physmap (how common is that?) > Juergen, Boris please correct me if I am wrong, but that XENFEAT_auto_translated_physmap > only works for PVH guests? That's both HVM and PVH (for dom0 it's only PVH). -boris > >> pfn_to_gfn is a nop, so plain phys_to_dma/dma_to_phys do work as-is. >> >> xen_arch_need_swiotlb always returns true for x86, and >> range_straddles_page_boundary should never be true for the >> XENFEAT_auto_translated_physmap case. > Correct. The kernel should have no clue of what the real MFNs are > for PFNs. >> So as far as I can tell the mapping fast path for the >> XENFEAT_auto_translated_physmap can be trivially reused from swiotlb. >> >> That leaves us with the next more complicated case, x86 or fully cache >> coherent arm{,64} without XENFEAT_auto_translated_physmap. In that case >> we need to patch in a phys_to_dma/dma_to_phys that performs the MFN >> lookup, which could be done using alternatives or jump labels. >> I think if that is done right we should also be able to let that cover >> the foreign pages in is_xen_swiotlb_buffer/is_swiotlb_buffer, but >> in that worst case that would need another alternative / jump label. >> >> For non-coherent arm{,64} we'd also need to use alternatives or jump >> labels to for the cache maintainance ops, but that isn't a hard problem >> either. >> >>
On Fri, 19 Feb 2021, Konrad Rzeszutek Wilk wrote: > On Sun, Feb 07, 2021 at 04:56:01PM +0100, Christoph Hellwig wrote: > > On Thu, Feb 04, 2021 at 09:40:23AM +0100, Christoph Hellwig wrote: > > > So one thing that has been on my mind for a while: I'd really like > > > to kill the separate dma ops in Xen swiotlb. If we compare xen-swiotlb > > > to swiotlb the main difference seems to be: > > > > > > - additional reasons to bounce I/O vs the plain DMA capable > > > - the possibility to do a hypercall on arm/arm64 > > > - an extra translation layer before doing the phys_to_dma and vice > > > versa > > > - an special memory allocator > > > > > > I wonder if inbetween a few jump labels or other no overhead enablement > > > options and possibly better use of the dma_range_map we could kill > > > off most of swiotlb-xen instead of maintaining all this code duplication? > > > > So I looked at this a bit more. > > > > For x86 with XENFEAT_auto_translated_physmap (how common is that?) > > Juergen, Boris please correct me if I am wrong, but that XENFEAT_auto_translated_physmap > only works for PVH guests? ARM is always XENFEAT_auto_translated_physmap > > pfn_to_gfn is a nop, so plain phys_to_dma/dma_to_phys do work as-is. > > > > xen_arch_need_swiotlb always returns true for x86, and > > range_straddles_page_boundary should never be true for the > > XENFEAT_auto_translated_physmap case. > > Correct. The kernel should have no clue of what the real MFNs are > for PFNs. On ARM, Linux knows the MFNs because for local pages MFN == PFN and for foreign pages it keeps track in arch/arm/xen/p2m.c. More on this below. xen_arch_need_swiotlb only returns true on ARM in rare situations where bouncing on swiotlb buffers is required. Today it only happens on old versions of Xen that don't support the cache flushing hypercall but there could be more cases in the future. > > > > So as far as I can tell the mapping fast path for the > > XENFEAT_auto_translated_physmap can be trivially reused from swiotlb. > > > > That leaves us with the next more complicated case, x86 or fully cache > > coherent arm{,64} without XENFEAT_auto_translated_physmap. In that case > > we need to patch in a phys_to_dma/dma_to_phys that performs the MFN > > lookup, which could be done using alternatives or jump labels. > > I think if that is done right we should also be able to let that cover > > the foreign pages in is_xen_swiotlb_buffer/is_swiotlb_buffer, but > > in that worst case that would need another alternative / jump label. > > > > For non-coherent arm{,64} we'd also need to use alternatives or jump > > labels to for the cache maintainance ops, but that isn't a hard problem > > either. With the caveat that ARM is always XENFEAT_auto_translated_physmap, what you wrote looks correct. I am writing down a brief explanation on how swiotlb-xen is used on ARM. pfn: address as seen by the guest, pseudo-physical address in ARM terminology mfn (or bfn): real address, physical address in ARM terminology On ARM dom0 is auto_translated (so Xen sets up the stage2 translation in the MMU) and the translation is 1:1. So pfn == mfn for Dom0. However, when another domain shares a page with Dom0, that page is not 1:1. Swiotlb-xen is used to retrieve the mfn for the foreign page at xen_swiotlb_map_page. It does that with xen_phys_to_bus -> pfn_to_bfn. It is implemented with a rbtree in arch/arm/xen/p2m.c. In addition, swiotlb-xen is also used to cache-flush the page via hypercall at xen_swiotlb_unmap_page. That is done because dev_addr is really the mfn at unmap_page and we don't know the pfn for it. We can do pfn-to-mfn but we cannot do mfn-to-pfn (there are good reasons for it unfortunately). The only way to cache-flush by mfn is by issuing a hypercall. The hypercall is implemented in arch/arm/xen/mm.c. The pfn != bfn and pfn_valid() checks are used to detect if the page is local (of dom0) or foreign; they work thanks to the fact that Dom0 is 1:1 mapped. Getting back to what you wrote, yes if we had a way to do MFN lookups in phys_to_dma, and a way to call the hypercall at unmap_page if the page is foreign (e.g. if it fails a pfn_valid check) then I think we would be good from an ARM perspective. The only exception is when xen_arch_need_swiotlb returns true, in which case we need to actually bounce on swiotlb buffers.
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c index 662638093542..e18cae693cdc 100644 --- a/drivers/xen/swiotlb-xen.c +++ b/drivers/xen/swiotlb-xen.c @@ -39,15 +39,17 @@ #include <asm/xen/page-coherent.h> #include <trace/events/swiotlb.h> -#define MAX_DMA_BITS 32 /* * Used to do a quick range check in swiotlb_tbl_unmap_single and * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by this * API. */ -static char *xen_io_tlb_start, *xen_io_tlb_end; -static unsigned long xen_io_tlb_nslabs; +static char *xen_io_tlb_start[SWIOTLB_MAX], *xen_io_tlb_end[SWIOTLB_MAX]; +static unsigned long xen_io_tlb_nslabs[SWIOTLB_MAX]; + +static int max_dma_bits[] = {32, 64}; + /* * Quick lookup value of the bus address of the IOTLB. */ @@ -112,8 +114,8 @@ static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr) * in our domain. Therefore _only_ check address within our domain. */ if (pfn_valid(PFN_DOWN(paddr))) { - return paddr >= virt_to_phys(xen_io_tlb_start) && - paddr < virt_to_phys(xen_io_tlb_end); + return paddr >= virt_to_phys(xen_io_tlb_start[SWIOTLB_LO]) && + paddr < virt_to_phys(xen_io_tlb_end[SWIOTLB_LO]); } return 0; } @@ -137,7 +139,7 @@ xen_swiotlb_fixup(void *buf, size_t size, unsigned long nslabs) p + (i << IO_TLB_SHIFT), get_order(slabs << IO_TLB_SHIFT), dma_bits, &dma_handle); - } while (rc && dma_bits++ < MAX_DMA_BITS); + } while (rc && dma_bits++ < max_dma_bits[SWIOTLB_LO]); if (rc) return rc; @@ -148,12 +150,13 @@ xen_swiotlb_fixup(void *buf, size_t size, unsigned long nslabs) static unsigned long xen_set_nslabs(unsigned long nr_tbl) { if (!nr_tbl) { - xen_io_tlb_nslabs = (64 * 1024 * 1024 >> IO_TLB_SHIFT); - xen_io_tlb_nslabs = ALIGN(xen_io_tlb_nslabs, IO_TLB_SEGSIZE); + xen_io_tlb_nslabs[SWIOTLB_LO] = (64 * 1024 * 1024 >> IO_TLB_SHIFT); + xen_io_tlb_nslabs[SWIOTLB_LO] = ALIGN(xen_io_tlb_nslabs[SWIOTLB_LO], + IO_TLB_SEGSIZE); } else - xen_io_tlb_nslabs = nr_tbl; + xen_io_tlb_nslabs[SWIOTLB_LO] = nr_tbl; - return xen_io_tlb_nslabs << IO_TLB_SHIFT; + return xen_io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT; } enum xen_swiotlb_err { @@ -184,16 +187,16 @@ int __ref xen_swiotlb_init(int verbose, bool early) enum xen_swiotlb_err m_ret = XEN_SWIOTLB_UNKNOWN; unsigned int repeat = 3; - xen_io_tlb_nslabs = swiotlb_nr_tbl(SWIOTLB_LO); + xen_io_tlb_nslabs[SWIOTLB_LO] = swiotlb_nr_tbl(SWIOTLB_LO); retry: - bytes = xen_set_nslabs(xen_io_tlb_nslabs); - order = get_order(xen_io_tlb_nslabs << IO_TLB_SHIFT); + bytes = xen_set_nslabs(xen_io_tlb_nslabs[SWIOTLB_LO]); + order = get_order(xen_io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT); /* * IO TLB memory already allocated. Just use it. */ if (io_tlb_start[SWIOTLB_LO] != 0) { - xen_io_tlb_start = phys_to_virt(io_tlb_start[SWIOTLB_LO]); + xen_io_tlb_start[SWIOTLB_LO] = phys_to_virt(io_tlb_start[SWIOTLB_LO]); goto end; } @@ -201,76 +204,78 @@ int __ref xen_swiotlb_init(int verbose, bool early) * Get IO TLB memory from any location. */ if (early) { - xen_io_tlb_start = memblock_alloc(PAGE_ALIGN(bytes), + xen_io_tlb_start[SWIOTLB_LO] = memblock_alloc(PAGE_ALIGN(bytes), PAGE_SIZE); - if (!xen_io_tlb_start) + if (!xen_io_tlb_start[SWIOTLB_LO]) panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__, PAGE_ALIGN(bytes), PAGE_SIZE); } else { #define SLABS_PER_PAGE (1 << (PAGE_SHIFT - IO_TLB_SHIFT)) #define IO_TLB_MIN_SLABS ((1<<20) >> IO_TLB_SHIFT) while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) { - xen_io_tlb_start = (void *)xen_get_swiotlb_free_pages(order); - if (xen_io_tlb_start) + xen_io_tlb_start[SWIOTLB_LO] = (void *)xen_get_swiotlb_free_pages(order); + if (xen_io_tlb_start[SWIOTLB_LO]) break; order--; } if (order != get_order(bytes)) { pr_warn("Warning: only able to allocate %ld MB for software IO TLB\n", (PAGE_SIZE << order) >> 20); - xen_io_tlb_nslabs = SLABS_PER_PAGE << order; - bytes = xen_io_tlb_nslabs << IO_TLB_SHIFT; + xen_io_tlb_nslabs[SWIOTLB_LO] = SLABS_PER_PAGE << order; + bytes = xen_io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT; } } - if (!xen_io_tlb_start) { + if (!xen_io_tlb_start[SWIOTLB_LO]) { m_ret = XEN_SWIOTLB_ENOMEM; goto error; } /* * And replace that memory with pages under 4GB. */ - rc = xen_swiotlb_fixup(xen_io_tlb_start, + rc = xen_swiotlb_fixup(xen_io_tlb_start[SWIOTLB_LO], bytes, - xen_io_tlb_nslabs); + xen_io_tlb_nslabs[SWIOTLB_LO]); if (rc) { if (early) - memblock_free(__pa(xen_io_tlb_start), + memblock_free(__pa(xen_io_tlb_start[SWIOTLB_LO]), PAGE_ALIGN(bytes)); else { - free_pages((unsigned long)xen_io_tlb_start, order); - xen_io_tlb_start = NULL; + free_pages((unsigned long)xen_io_tlb_start[SWIOTLB_LO], order); + xen_io_tlb_start[SWIOTLB_LO] = NULL; } m_ret = XEN_SWIOTLB_EFIXUP; goto error; } if (early) { - if (swiotlb_init_with_tbl(xen_io_tlb_start, xen_io_tlb_nslabs, + if (swiotlb_init_with_tbl(xen_io_tlb_start[SWIOTLB_LO], + xen_io_tlb_nslabs[SWIOTLB_LO], SWIOTLB_LO, verbose)) panic("Cannot allocate SWIOTLB buffer"); rc = 0; } else - rc = swiotlb_late_init_with_tbl(xen_io_tlb_start, - xen_io_tlb_nslabs, SWIOTLB_LO); + rc = swiotlb_late_init_with_tbl(xen_io_tlb_start[SWIOTLB_LO], + xen_io_tlb_nslabs[SWIOTLB_LO], + SWIOTLB_LO); end: - xen_io_tlb_end = xen_io_tlb_start + bytes; + xen_io_tlb_end[SWIOTLB_LO] = xen_io_tlb_start[SWIOTLB_LO] + bytes; if (!rc) swiotlb_set_max_segment(PAGE_SIZE, SWIOTLB_LO); return rc; error: if (repeat--) { - xen_io_tlb_nslabs = max(1024UL, /* Min is 2MB */ - (xen_io_tlb_nslabs >> 1)); + xen_io_tlb_nslabs[SWIOTLB_LO] = max(1024UL, /* Min is 2MB */ + (xen_io_tlb_nslabs[SWIOTLB_LO] >> 1)); pr_info("Lowering to %luMB\n", - (xen_io_tlb_nslabs << IO_TLB_SHIFT) >> 20); + (xen_io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT) >> 20); goto retry; } pr_err("%s (rc:%d)\n", xen_swiotlb_error(m_ret), rc); if (early) panic("%s (rc:%d)", xen_swiotlb_error(m_ret), rc); else - free_pages((unsigned long)xen_io_tlb_start, order); + free_pages((unsigned long)xen_io_tlb_start[SWIOTLB_LO], order); return rc; } @@ -561,7 +566,7 @@ xen_swiotlb_sync_sg_for_device(struct device *dev, struct scatterlist *sgl, static int xen_swiotlb_dma_supported(struct device *hwdev, u64 mask) { - return xen_virt_to_bus(hwdev, xen_io_tlb_end - 1) <= mask; + return xen_virt_to_bus(hwdev, xen_io_tlb_end[SWIOTLB_LO] - 1) <= mask; } const struct dma_map_ops xen_swiotlb_dma_ops = {
This patch converts several xen-swiotlb related variables to arrays, in order to maintain stat/status for different swiotlb buffers. Here are variables involved: - xen_io_tlb_start and xen_io_tlb_end - xen_io_tlb_nslabs - MAX_DMA_BITS There is no functional change and this is to prepare to enable 64-bit xen-swiotlb. Cc: Joe Jin <joe.jin@oracle.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> --- drivers/xen/swiotlb-xen.c | 75 +++++++++++++++++++++------------------ 1 file changed, 40 insertions(+), 35 deletions(-)