[v1:,vfio:,avoid,unnecessary,pin,memory,when,dma,map,io,address,space,0/2]

Message ID	cover.1729760996.git.qinyuntan@linux.alibaba.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Qinyun Tan <qinyuntan@linux.alibaba.com> To: Andrew Morton <akpm@linux-foundation.org>, Alex Williamson <alex.williamson@redhat.com> Cc: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Qinyun Tan <qinyuntan@linux.alibaba.com> Subject: [PATCH v1: vfio: avoid unnecessary pin memory when dma map io address space 0/2] Date: Thu, 24 Oct 2024 17:34:42 +0800 Message-ID: <cover.1729760996.git.qinyuntan@linux.alibaba.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v1:,vfio:,avoid,unnecessary,pin,memory,when,dma,map,io,address,space,1/2] mm: introduce vma flag VM_PGOFF_IS_PFN \| expand [v1:,vfio:,avoid,unnecessary,pin,memory,when,dma,map,io,address,space,0/2] [v1:,vfio:,avoid,unnecessary,pin,memory,when,dma,map,io,address,space,1/2] mm: introduce vma flag V… [v1:,vfio:,avoid,unnecessary,pin,memory,when,dma,map,io,address,space,2/2] vfio: avoid unnecessary …

Message ID

cover.1729760996.git.qinyuntan@linux.alibaba.com (mailing list archive)

Headers

From: Qinyun Tan <qinyuntan@linux.alibaba.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	Alex Williamson <alex.williamson@redhat.com>
Cc: linux-mm@kvack.org,
	kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Qinyun Tan <qinyuntan@linux.alibaba.com>
Subject: [PATCH v1: vfio: avoid unnecessary pin memory when dma map io address
 space 0/2] 
Date: Thu, 24 Oct 2024 17:34:42 +0800
Message-ID: <cover.1729760996.git.qinyuntan@linux.alibaba.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[v1:,vfio:,avoid,unnecessary,pin,memory,when,dma,map,io,address,space,1/2] mm: introduce vma flag VM_PGOFF_IS_PFN | expand

Message

qinyuntan Oct. 24, 2024, 9:34 a.m. UTC

When user application call ioctl(VFIO_IOMMU_MAP_DMA) to map a dma address,
the general handler 'vfio_pin_map_dma' attempts to pin the memory and
then create the mapping in the iommu.

However, some mappings aren't backed by a struct page, for example an
mmap'd MMIO range for our own or another device. In this scenario, a vma
with flag VM_IO | VM_PFNMAP, the pin operation will fail. Moreover, the
pin operation incurs a large overhead which will result in a longer
startup time for the VM. We don't actually need a pin in this scenario.

To address this issue, we introduce a new DMA MAP flag
'VFIO_DMA_MAP_FLAG_MMIO_DONT_PIN' to skip the 'vfio_pin_pages_remote'
operation in the DMA map process for mmio memory. Additionally, we add
the 'VM_PGOFF_IS_PFN' flag for vfio_pci_mmap address, ensuring that we can
directly obtain the pfn through vma->vm_pgoff.

This approach allows us to avoid unnecessary memory pinning operations,
which would otherwise introduce additional overhead during DMA mapping.

In my tests, using vfio to pass through an 8-card AMD GPU which with a
large bar size (128GB*8), the time mapping the 192GB*8 bar was reduced
from about 50.79s to 1.57s.

Qinyun Tan (2):
  mm: introduce vma flag VM_PGOFF_IS_PFN
  vfio: avoid unnecessary pin memory when dma map io address space

 drivers/vfio/pci/vfio_pci_core.c |  2 +-
 drivers/vfio/vfio_iommu_type1.c  | 64 +++++++++++++++++++++++++-------
 include/linux/mm.h               |  6 +++
 include/uapi/linux/vfio.h        | 11 ++++++
 4 files changed, 68 insertions(+), 15 deletions(-)

Comments

Alex Williamson Oct. 24, 2024, 5:06 p.m. UTC | #1

On Thu, 24 Oct 2024 17:34:42 +0800
Qinyun Tan <qinyuntan@linux.alibaba.com> wrote:

> When user application call ioctl(VFIO_IOMMU_MAP_DMA) to map a dma address,
> the general handler 'vfio_pin_map_dma' attempts to pin the memory and
> then create the mapping in the iommu.
> 
> However, some mappings aren't backed by a struct page, for example an
> mmap'd MMIO range for our own or another device. In this scenario, a vma
> with flag VM_IO | VM_PFNMAP, the pin operation will fail. Moreover, the
> pin operation incurs a large overhead which will result in a longer
> startup time for the VM. We don't actually need a pin in this scenario.
> 
> To address this issue, we introduce a new DMA MAP flag
> 'VFIO_DMA_MAP_FLAG_MMIO_DONT_PIN' to skip the 'vfio_pin_pages_remote'
> operation in the DMA map process for mmio memory. Additionally, we add
> the 'VM_PGOFF_IS_PFN' flag for vfio_pci_mmap address, ensuring that we can
> directly obtain the pfn through vma->vm_pgoff.
> 
> This approach allows us to avoid unnecessary memory pinning operations,
> which would otherwise introduce additional overhead during DMA mapping.
> 
> In my tests, using vfio to pass through an 8-card AMD GPU which with a
> large bar size (128GB*8), the time mapping the 192GB*8 bar was reduced
> from about 50.79s to 1.57s.

If the vma has a flag to indicate pfnmap, why does the user need to
provide a mapping flag to indicate not to pin?  We generally cannot
trust such a user directive anyway, nor do we in this series, so it all
seems rather redundant.

What about simply improving the batching of pfnmap ranges rather than
imposing any sort of mm or uapi changes?  Or perhaps, since we're now
using huge_fault to populate the vma, maybe we can iterate at PMD or
PUD granularity rather than PAGE_SIZE?  Seems like we have plenty of
optimizations to pursue that could be done transparently to the user.
Thanks,

Alex

Jason Gunthorpe Oct. 24, 2024, 6:19 p.m. UTC | #2

On Thu, Oct 24, 2024 at 11:06:24AM -0600, Alex Williamson wrote:
> On Thu, 24 Oct 2024 17:34:42 +0800
> Qinyun Tan <qinyuntan@linux.alibaba.com> wrote:
> 
> > When user application call ioctl(VFIO_IOMMU_MAP_DMA) to map a dma address,
> > the general handler 'vfio_pin_map_dma' attempts to pin the memory and
> > then create the mapping in the iommu.
> > 
> > However, some mappings aren't backed by a struct page, for example an
> > mmap'd MMIO range for our own or another device. In this scenario, a vma
> > with flag VM_IO | VM_PFNMAP, the pin operation will fail. Moreover, the
> > pin operation incurs a large overhead which will result in a longer
> > startup time for the VM. We don't actually need a pin in this scenario.
> > 
> > To address this issue, we introduce a new DMA MAP flag
> > 'VFIO_DMA_MAP_FLAG_MMIO_DONT_PIN' to skip the 'vfio_pin_pages_remote'
> > operation in the DMA map process for mmio memory. Additionally, we add
> > the 'VM_PGOFF_IS_PFN' flag for vfio_pci_mmap address, ensuring that we can
> > directly obtain the pfn through vma->vm_pgoff.
> > 
> > This approach allows us to avoid unnecessary memory pinning operations,
> > which would otherwise introduce additional overhead during DMA mapping.
> > 
> > In my tests, using vfio to pass through an 8-card AMD GPU which with a
> > large bar size (128GB*8), the time mapping the 192GB*8 bar was reduced
> > from about 50.79s to 1.57s.
> 
> If the vma has a flag to indicate pfnmap, why does the user need to
> provide a mapping flag to indicate not to pin?  We generally cannot
> trust such a user directive anyway, nor do we in this series, so it all
> seems rather redundant.

The best answer is to map from DMABUF not from VMA and then you get
perfect aggregation cheaply.
 
> What about simply improving the batching of pfnmap ranges rather than
> imposing any sort of mm or uapi changes?  Or perhaps, since we're now
> using huge_fault to populate the vma, maybe we can iterate at PMD or
> PUD granularity rather than PAGE_SIZE?  Seems like we have plenty of
> optimizations to pursue that could be done transparently to the
> user.

I don't want to add more stuff to support the security broken
follow_pfn path. It needs to be replaced.

Leon's work to improve the DMA API is soo close so we may be close to
the end!

There are two versions of the dmabuf patches on the list, it would be
good to get that in good shape. We could make a full solution,
including the vfio/iommufd map side while waiting.

Jason

qinyuntan Oct. 29, 2024, 2:50 a.m. UTC | #3

> 2024年10月25日 01:06，Alex Williamson <alex.williamson@redhat.com> 写道：
> 
> On Thu, 24 Oct 2024 17:34:42 +0800
> Qinyun Tan <qinyuntan@linux.alibaba.com> wrote:
> 
>> When user application call ioctl(VFIO_IOMMU_MAP_DMA) to map a dma address,
>> the general handler 'vfio_pin_map_dma' attempts to pin the memory and
>> then create the mapping in the iommu.
>> 
>> However, some mappings aren't backed by a struct page, for example an
>> mmap'd MMIO range for our own or another device. In this scenario, a vma
>> with flag VM_IO | VM_PFNMAP, the pin operation will fail. Moreover, the
>> pin operation incurs a large overhead which will result in a longer
>> startup time for the VM. We don't actually need a pin in this scenario.
>> 
>> To address this issue, we introduce a new DMA MAP flag
>> 'VFIO_DMA_MAP_FLAG_MMIO_DONT_PIN' to skip the 'vfio_pin_pages_remote'
>> operation in the DMA map process for mmio memory. Additionally, we add
>> the 'VM_PGOFF_IS_PFN' flag for vfio_pci_mmap address, ensuring that we can
>> directly obtain the pfn through vma->vm_pgoff.
>> 
>> This approach allows us to avoid unnecessary memory pinning operations,
>> which would otherwise introduce additional overhead during DMA mapping.
>> 
>> In my tests, using vfio to pass through an 8-card AMD GPU which with a
>> large bar size (128GB*8), the time mapping the 192GB*8 bar was reduced
>> from about 50.79s to 1.57s.
> 
> If the vma has a flag to indicate pfnmap, why does the user need to
> provide a mapping flag to indicate not to pin?  We generally cannot
> trust such a user directive anyway, nor do we in this series, so it all
> seems rather redundant.
> 
> What about simply improving the batching of pfnmap ranges rather than
> imposing any sort of mm or uapi changes?  Or perhaps, since we're now
> using huge_fault to populate the vma, maybe we can iterate at PMD or
> PUD granularity rather than PAGE_SIZE?  Seems like we have plenty of
> optimizations to pursue that could be done transparently to the user.
> Thanks,
> 
> Alex

qinyuntan Oct. 29, 2024, 3:32 a.m. UTC | #4

You are right, it seems I did not get the relevant updates in time. In 
the patch f9e54c3a2f5b7 ("vfio/pci: implement huge_fault support"), 
huge_fault was introduced, and maybe we can achieve the same effect by 
adjusting the function vfio_pci_mmap_huge_fault's order parameter.
Thanks,

Qinyun Tan

On 2024/10/25 01:06, Alex Williamson wrote:
> On Thu, 24 Oct 2024 17:34:42 +0800
> Qinyun Tan <qinyuntan@linux.alibaba.com> wrote:
> 
>> When user application call ioctl(VFIO_IOMMU_MAP_DMA) to map a dma address,
>> the general handler 'vfio_pin_map_dma' attempts to pin the memory and
>> then create the mapping in the iommu.
>>
>> However, some mappings aren't backed by a struct page, for example an
>> mmap'd MMIO range for our own or another device. In this scenario, a vma
>> with flag VM_IO | VM_PFNMAP, the pin operation will fail. Moreover, the
>> pin operation incurs a large overhead which will result in a longer
>> startup time for the VM. We don't actually need a pin in this scenario.
>>
>> To address this issue, we introduce a new DMA MAP flag
>> 'VFIO_DMA_MAP_FLAG_MMIO_DONT_PIN' to skip the 'vfio_pin_pages_remote'
>> operation in the DMA map process for mmio memory. Additionally, we add
>> the 'VM_PGOFF_IS_PFN' flag for vfio_pci_mmap address, ensuring that we can
>> directly obtain the pfn through vma->vm_pgoff.
>>
>> This approach allows us to avoid unnecessary memory pinning operations,
>> which would otherwise introduce additional overhead during DMA mapping.
>>
>> In my tests, using vfio to pass through an 8-card AMD GPU which with a
>> large bar size (128GB*8), the time mapping the 192GB*8 bar was reduced
>> from about 50.79s to 1.57s.
> 
> If the vma has a flag to indicate pfnmap, why does the user need to
> provide a mapping flag to indicate not to pin?  We generally cannot
> trust such a user directive anyway, nor do we in this series, so it all
> seems rather redundant.
> 
> What about simply improving the batching of pfnmap ranges rather than
> imposing any sort of mm or uapi changes?  Or perhaps, since we're now
> using huge_fault to populate the vma, maybe we can iterate at PMD or
> PUD granularity rather than PAGE_SIZE?  Seems like we have plenty of
> optimizations to pursue that could be done transparently to the user.
> Thanks,
> 
> Alex