Message ID | 4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | Allow MMIO regions to be exported through dma-buf | expand |
> +{ > + struct vfio_pci_dma_buf *priv = dmabuf->priv; > + int rc; > + > + rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1, > + true); This should just use pci_p2pdma_distance. > + /* > + * Since the memory being mapped is a device memory it could never be in > + * CPU caches. > + */ DMA_ATTR_SKIP_CPU_SYNC doesn't even apply to dma_map_resource, not sure where this wisdom comes from. > + dma_addr = dma_map_resource( > + attachment->dev, > + pci_resource_start(priv->vdev->pdev, priv->index) + > + priv->offset, > + priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC); This is not how P2P addresses are mapped. You need to use dma_map_sgtable and have the proper pgmap for it. The above is just a badly implemented version of the dma-direct PCI_P2PDMA_MAP_BUS_ADDR case, ignoring mappings through the host bridge or dma-map-ops interactions.
Am 06.09.22 um 11:51 schrieb Christoph Hellwig: >> +{ >> + struct vfio_pci_dma_buf *priv = dmabuf->priv; >> + int rc; >> + >> + rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1, >> + true); > This should just use pci_p2pdma_distance. > >> + /* >> + * Since the memory being mapped is a device memory it could never be in >> + * CPU caches. >> + */ > DMA_ATTR_SKIP_CPU_SYNC doesn't even apply to dma_map_resource, not sure > where this wisdom comes from. > >> + dma_addr = dma_map_resource( >> + attachment->dev, >> + pci_resource_start(priv->vdev->pdev, priv->index) + >> + priv->offset, >> + priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC); > This is not how P2P addresses are mapped. You need to use > dma_map_sgtable and have the proper pgmap for it. The problem is once more that this is MMIO space, in other words register BARs which needs to be exported/imported. Adding struct pages for it generally sounds like the wrong approach here. You can't even access this with the CPU or would trigger potentially unwanted hardware actions. Would you mind if I start to tackle this problem? Regards, Christian. > > The above is just a badly implemented version of the dma-direct > PCI_P2PDMA_MAP_BUS_ADDR case, ignoring mappings through the host > bridge or dma-map-ops interactions.
On Tue, Sep 06, 2022 at 12:38:44PM +0200, Christian König wrote: > Am 06.09.22 um 11:51 schrieb Christoph Hellwig: > > > +{ > > > + struct vfio_pci_dma_buf *priv = dmabuf->priv; > > > + int rc; > > > + > > > + rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1, > > > + true); > > This should just use pci_p2pdma_distance. OK > > > + /* > > > + * Since the memory being mapped is a device memory it could never be in > > > + * CPU caches. > > > + */ > > DMA_ATTR_SKIP_CPU_SYNC doesn't even apply to dma_map_resource, not sure > > where this wisdom comes from. Habana driver > > > + dma_addr = dma_map_resource( > > > + attachment->dev, > > > + pci_resource_start(priv->vdev->pdev, priv->index) + > > > + priv->offset, > > > + priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC); > > This is not how P2P addresses are mapped. You need to use > > dma_map_sgtable and have the proper pgmap for it. > > The problem is once more that this is MMIO space, in other words register > BARs which needs to be exported/imported. > > Adding struct pages for it generally sounds like the wrong approach here. > You can't even access this with the CPU or would trigger potentially > unwanted hardware actions. Right, this whole thing is the "standard" that dmabuf has adopted instead of the struct pages. Once the AMD GPU driver started doing this some time ago other drivers followed. Now we have struct pages, almost, but I'm not sure if their limits are compatible with VFIO? This has to work for small bars as well. Jason
On Tue, Sep 6, 2022 at 2:48 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Tue, Sep 06, 2022 at 12:38:44PM +0200, Christian König wrote: > > Am 06.09.22 um 11:51 schrieb Christoph Hellwig: > > > > +{ > > > > + struct vfio_pci_dma_buf *priv = dmabuf->priv; > > > > + int rc; > > > > + > > > > + rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1, > > > > + true); > > > This should just use pci_p2pdma_distance. > > OK > > > > > + /* > > > > + * Since the memory being mapped is a device memory it could never be in > > > > + * CPU caches. > > > > + */ > > > DMA_ATTR_SKIP_CPU_SYNC doesn't even apply to dma_map_resource, not sure > > > where this wisdom comes from. > > Habana driver I hate to throw the ball at someone else, but I actually copied the code from the amdgpu driver, from amdgpu_vram_mgr_alloc_sgt() iirc. And if you remember Jason, you asked why we use this specific define in the original review you did and I replied the following (to which you agreed and that's why we added the comment): "The memory behind this specific dma-buf has *always* resided on the device itself, i.e. it lives only in the 'device' domain (after all, it maps a PCI bar address which points to the device memory). Therefore, it was never in the 'CPU' domain and hence, there is no need to perform a sync of the memory to the CPU's cache, as it was never inside that cache to begin with. This is not the same case as with regular memory which is dma-mapped and then copied into the device using a dma engine. In that case, the memory started in the 'CPU' domain and moved to the 'device' domain. When it is unmapped it will indeed be recycled to be used for another purpose and therefore we need to sync the CPU cache." Oded > > > > > + dma_addr = dma_map_resource( > > > > + attachment->dev, > > > > + pci_resource_start(priv->vdev->pdev, priv->index) + > > > > + priv->offset, > > > > + priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC); > > > This is not how P2P addresses are mapped. You need to use > > > dma_map_sgtable and have the proper pgmap for it. > > > > The problem is once more that this is MMIO space, in other words register > > BARs which needs to be exported/imported. > > > > Adding struct pages for it generally sounds like the wrong approach here. > > You can't even access this with the CPU or would trigger potentially > > unwanted hardware actions. > > Right, this whole thing is the "standard" that dmabuf has adopted > instead of the struct pages. Once the AMD GPU driver started doing > this some time ago other drivers followed. > > Now we have struct pages, almost, but I'm not sure if their limits are > compatible with VFIO? This has to work for small bars as well. > > Jason
On Tue, Sep 06, 2022 at 03:34:02PM +0300, Oded Gabbay wrote: > > > > > + /* > > > > > + * Since the memory being mapped is a device memory it could never be in > > > > > + * CPU caches. > > > > > + */ > > > > DMA_ATTR_SKIP_CPU_SYNC doesn't even apply to dma_map_resource, not sure > > > > where this wisdom comes from. > > > > Habana driver > I hate to throw the ball at someone else, but I actually copied the > code from the amdgpu driver, from amdgpu_vram_mgr_alloc_sgt() iirc. > And if you remember Jason, you asked why we use this specific define > in the original review you did and I replied the following (to which > you agreed and that's why we added the comment): Yes, I remember, but Christophs remark is that DMA_ATTR_SKIP_CPU_SYNC doesn't even do anything when passed to dma_map_resource(). The only attr that seems to be used is DMA_ATTR_PRIVILEGED from what I can see. Jason
On Tue, Sep 6, 2022 at 8:59 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Tue, Sep 06, 2022 at 03:34:02PM +0300, Oded Gabbay wrote: > > > > > > > + /* > > > > > > + * Since the memory being mapped is a device memory it could never be in > > > > > > + * CPU caches. > > > > > > + */ > > > > > DMA_ATTR_SKIP_CPU_SYNC doesn't even apply to dma_map_resource, not sure > > > > > where this wisdom comes from. > > > > > > Habana driver > > I hate to throw the ball at someone else, but I actually copied the > > code from the amdgpu driver, from amdgpu_vram_mgr_alloc_sgt() iirc. > > And if you remember Jason, you asked why we use this specific define > > in the original review you did and I replied the following (to which > > you agreed and that's why we added the comment): > > Yes, I remember, but Christophs remark is that DMA_ATTR_SKIP_CPU_SYNC > doesn't even do anything when passed to dma_map_resource(). > > The only attr that seems to be used is DMA_ATTR_PRIVILEGED from what I > can see. > > Jason Yes, it appears he is correct. Probably what happened is that either this was originally copied from a use of dma_map_page or something similar that does check this attribute, or maybe dma_map_resource used it in the past and the underlying code has changed. Regardless, it seems we can remove it from the calls to dma_map_resource. I went over the kernel code and it seems only habana and amdgpu (and amdkfd) are passing this property to dma_map_resource. All other callers just pass 0. Oded
On Tue, Sep 06, 2022 at 12:38:44PM +0200, Christian König wrote: > The problem is once more that this is MMIO space, in other words register > BARs which needs to be exported/imported. Everything used for P2P is bar space. > Adding struct pages for it generally sounds like the wrong approach here. > You can't even access this with the CPU or would trigger potentially > unwanted hardware actions. How would an access from the CPU vs anther device make any difference? > Would you mind if I start to tackle this problem? Yes, I've been waiting forever for someone to tacke how the scatterlist is abused in dma-buf..
On Tue, Sep 06, 2022 at 08:48:28AM -0300, Jason Gunthorpe wrote: > Right, this whole thing is the "standard" that dmabuf has adopted > instead of the struct pages. Once the AMD GPU driver started doing > this some time ago other drivers followed. But it is simple wrong. The scatterlist requires struct page backing. In theory a physical address would be enough, but when Dan Williams sent patches for that Linus shot them down. That being said the scatterlist is the wrong interface here (and probably for most of it's uses). We really want a lot-level struct with just the dma_address and length for the DMA side, and leave it separate from that what is used to generate it (in most cases that would be a bio_vec). > Now we have struct pages, almost, but I'm not sure if their limits are > compatible with VFIO? This has to work for small bars as well. Why would small BARs be problematic for the pages? The pages are more a problem for gigantic BARs do the memory overhead.
On Tue, Sep 06, 2022 at 10:44:45PM +0300, Oded Gabbay wrote: > Regardless, it seems we can remove it from the calls to > dma_map_resource. I went over the kernel code and it seems only habana > and amdgpu (and amdkfd) are passing this property to dma_map_resource. > All other callers just pass 0. What really need to happen is a revert of that entire commit, given that dma_map_resource must not be used for PCIe P2P transactions as explained in my previous mail.
On Wed, Sep 07, 2022 at 05:05:57AM -0700, Christoph Hellwig wrote: > On Tue, Sep 06, 2022 at 08:48:28AM -0300, Jason Gunthorpe wrote: > > Right, this whole thing is the "standard" that dmabuf has adopted > > instead of the struct pages. Once the AMD GPU driver started doing > > this some time ago other drivers followed. > > But it is simple wrong. The scatterlist requires struct page backing. > In theory a physical address would be enough, but when Dan Williams > sent patches for that Linus shot them down. Yes, you said that, and I said that when the AMD driver first merged it - but it went in anyhow and now people are using it in a bunch of places. I'm happy that Christian wants to start trying to fix it, and will help him, but it doesn't really impact this. Whatever fix is cooked up will apply equally to vfio and habana. > That being said the scatterlist is the wrong interface here (and > probably for most of it's uses). We really want a lot-level struct > with just the dma_address and length for the DMA side, and leave it > separate from that what is used to generate it (in most cases that > would be a bio_vec). Oh definitely > > Now we have struct pages, almost, but I'm not sure if their limits are > > compatible with VFIO? This has to work for small bars as well. > > Why would small BARs be problematic for the pages? The pages are more > a problem for gigantic BARs do the memory overhead. How do I get a struct page * for a 4k BAR in vfio? The docs say: ..hotplug api on memory block boundaries. The implementation relies on this lack of user-api constraint to allow sub-section sized memory ranges to be specified to :c:func:`arch_add_memory`, the top-half of memory hotplug. Sub-section support allows for 2MB as the cross-arch common alignment granularity for :c:func:`devm_memremap_pages`. Jason
On Wed, Sep 07, 2022 at 09:33:11AM -0300, Jason Gunthorpe wrote: > Yes, you said that, and I said that when the AMD driver first merged > it - but it went in anyhow and now people are using it in a bunch of > places. drm folks made up their own weird rules, if they internally stick to it they have to listen to it given that they ignore review comments, but it violates the scatterlist API and has not business anywhere else in the kernel. And yes, there probably is a reason or two why the drm code is unusually error prone. > > Why would small BARs be problematic for the pages? The pages are more > > a problem for gigantic BARs do the memory overhead. > > How do I get a struct page * for a 4k BAR in vfio? I guess we have different definitions of small then :) But unless my understanding of the code is out out of data, memremap_pages just requires the (virtual) start address to be 2MB aligned, not the size. Adding Dan for comments. That being said, what is the point of mapping say a 4k BAR for p2p? You're not going to save a measurable amount of CPU overhead if that is the only place you transfer to.
On Wed, Sep 7, 2022 at 5:30 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Wed, Sep 07, 2022 at 09:33:11AM -0300, Jason Gunthorpe wrote: > > Yes, you said that, and I said that when the AMD driver first merged > > it - but it went in anyhow and now people are using it in a bunch of > > places. > > drm folks made up their own weird rules, if they internally stick > to it they have to listen to it given that they ignore review comments, > but it violates the scatterlist API and has not business anywhere > else in the kernel. And yes, there probably is a reason or two why > the drm code is unusually error prone. > > > > Why would small BARs be problematic for the pages? The pages are more > > > a problem for gigantic BARs do the memory overhead. > > > > How do I get a struct page * for a 4k BAR in vfio? > > I guess we have different definitions of small then :) > > But unless my understanding of the code is out out of data, > memremap_pages just requires the (virtual) start address to be 2MB > aligned, not the size. Adding Dan for comments. > > That being said, what is the point of mapping say a 4k BAR for p2p? > You're not going to save a measurable amount of CPU overhead if that > is the only place you transfer to. I don't know what Jason had in mind, but I can see a use for that for writing to doorbells of a device. Today, usually what happens is that peer A reads/writes to peer B's memory through the large bar and then signals the host the operation was completed. Then the host s/w writes to the doorbell of the peer B to let him know he can continue with the execution as the data is now ready (or can be recycled). I can imagine peer A writing directly to the doorbell of peer B, and usually for that we would like to expose a very small area, probably a single 4K page. Oded
On 2022-09-07 13:33, Jason Gunthorpe wrote: > On Wed, Sep 07, 2022 at 05:05:57AM -0700, Christoph Hellwig wrote: >> On Tue, Sep 06, 2022 at 08:48:28AM -0300, Jason Gunthorpe wrote: >>> Right, this whole thing is the "standard" that dmabuf has adopted >>> instead of the struct pages. Once the AMD GPU driver started doing >>> this some time ago other drivers followed. >> >> But it is simple wrong. The scatterlist requires struct page backing. >> In theory a physical address would be enough, but when Dan Williams >> sent patches for that Linus shot them down. > > Yes, you said that, and I said that when the AMD driver first merged > it - but it went in anyhow and now people are using it in a bunch of > places. > > I'm happy that Christian wants to start trying to fix it, and will > help him, but it doesn't really impact this. Whatever fix is cooked up > will apply equally to vfio and habana. We've just added support for P2P segments in scatterlists, can that not be used here? Robin. >> That being said the scatterlist is the wrong interface here (and >> probably for most of it's uses). We really want a lot-level struct >> with just the dma_address and length for the DMA side, and leave it >> separate from that what is used to generate it (in most cases that >> would be a bio_vec). > > Oh definitely > >>> Now we have struct pages, almost, but I'm not sure if their limits are >>> compatible with VFIO? This has to work for small bars as well. >> >> Why would small BARs be problematic for the pages? The pages are more >> a problem for gigantic BARs do the memory overhead. > > How do I get a struct page * for a 4k BAR in vfio? > > The docs say: > > ..hotplug api on memory block boundaries. The implementation relies on > this lack of user-api constraint to allow sub-section sized memory > ranges to be specified to :c:func:`arch_add_memory`, the top-half of > memory hotplug. Sub-section support allows for 2MB as the cross-arch > common alignment granularity for :c:func:`devm_memremap_pages`. > > Jason
Am 07.09.22 um 14:03 schrieb Christoph Hellwig: > On Tue, Sep 06, 2022 at 12:38:44PM +0200, Christian König wrote: >> The problem is once more that this is MMIO space, in other words register >> BARs which needs to be exported/imported. > Everything used for P2P is bar space. > >> Adding struct pages for it generally sounds like the wrong approach here. >> You can't even access this with the CPU or would trigger potentially >> unwanted hardware actions. > How would an access from the CPU vs anther device make any difference? The key point is that you can't do any CPU fallback with this as long as the CPU wouldn't do exactly the same thing as the original hardware device. E.g. not write combine nor do any fully page copies etc... See what happens here is not really P2P DMA transfer, but rather P2P signaling of events. For a simple example think of a camera and a video codec. The camera is pumping video data into system memory the video codec should encode into an H264 stream. So after every frame the camera hardware issues a P2P write into the BAR of the video codec to signal that the frame is completed and it can start decoding. This is not even a memory write, but rather just some trigger event. That's essentially the background why I think having struct pages and sg_tables doesn't make any sense at all for this use case. >> Would you mind if I start to tackle this problem? > Yes, I've been waiting forever for someone to tacke how the scatterlist > is abused in dma-buf.. How about we separate the scatterlist into page and DMA address container? Regards, Christian.
On Wed, Sep 07, 2022 at 07:29:58AM -0700, Christoph Hellwig wrote: > On Wed, Sep 07, 2022 at 09:33:11AM -0300, Jason Gunthorpe wrote: > > Yes, you said that, and I said that when the AMD driver first merged > > it - but it went in anyhow and now people are using it in a bunch of > > places. > > drm folks made up their own weird rules, if they internally stick > to it they have to listen to it given that they ignore review comments, > but it violates the scatterlist API and has not business anywhere > else in the kernel. And yes, there probably is a reason or two why > the drm code is unusually error prone. That may be, but it is creating problems if DRM gets to do X crazy thing and nobody else can.. So, we have two issues here 1) DMABUF abuses the scatter list, but this is very constrainted we have this weird special "DMABUF scatterlist" that is only touched by DMABUF importers. The imports signal that they understand the format with a flag. This is ugly and would be nice to clean to a dma mapped address list of some sort. I spent alot of time a few years ago removing driver touches of the SGL and preparing the RDMA stack to do this kind of change, at least. 2) DMABUF abuses dma_map_resource() for P2P and thus doesn't work in certain special cases. Rather than jump to ZONE_DEVICE and map_sgl I would like to continue to support non-struct page mapping. So, I would suggest adding a dma_map_p2p() that can cover off the special cases, include the two struct devices as arguments with a physical PFN/size. Do the same logic we do under the ZONE_DEVICE stuff. Drivers can choose if they want to pay the memory cost of ZONE_DEVICE and get faster dma_map or get slower dma_map and save memory. I still think we can address them incrementally - but the dma_map_p2p() might be small enough to sort out right away, if you are OK with it. > > > Why would small BARs be problematic for the pages? The pages are more > > > a problem for gigantic BARs do the memory overhead. > > > > How do I get a struct page * for a 4k BAR in vfio? > > I guess we have different definitions of small then :) > > But unless my understanding of the code is out out of data, > memremap_pages just requires the (virtual) start address to be 2MB > aligned, not the size. Adding Dan for comments. Don't we need the virtual start address to equal the physical pfn for everything to work properly? eg pfn_to_page? And we can't over-allocate because another driver might want to also use ZONE_DEVICE pages for its BAR that is now creating a collision. So, at least as is, the memmap stuff seems unable to support the case we have with VFIO. > That being said, what is the point of mapping say a 4k BAR for p2p? > You're not going to save a measurable amount of CPU overhead if that > is the only place you transfer to. For the purpose this series is chasing, it is for doorbell rings. The actual data transfer may still bounce through CPU memory (if a CMB is not available), but the latency reduction of directly signaling the peer device that the transfer is ready is the key objective. Bouncing an interrupt through the CPU to cause it to do a writel() is very tiem consuming, especially on slow ARM devices, while we have adequate memory bandwidth for data transfer. When I look at iommufd, it is for generality and compat. We don't have knowledge of what the guest will do, so regardless of BAR size we have to create P2P iommu mappings for every kind of PCI BAR. It is what vfio is currently doing. Jason
On Wed, Sep 07, 2022 at 05:08:37PM +0200, Christian König wrote: > The key point is that you can't do any CPU fallback with this as long as the > CPU wouldn't do exactly the same thing as the original hardware device. E.g. > not write combine nor do any fully page copies etc... That is true for MMIO in general. You need to use the mmio accessors, and even then it needs to match what the hardware expects > This is not even a memory write, but rather just some trigger event. That's > essentially the background why I think having struct pages and sg_tables > doesn't make any sense at all for this use case. Well, Jasons patch uses a sg_table, just in a very broken way. > > > Would you mind if I start to tackle this problem? > > Yes, I've been waiting forever for someone to tacke how the scatterlist > > is abused in dma-buf.. > > How about we separate the scatterlist into page and DMA address container? As said before that is fundamentally the right thing to do. It just takes a lot of work. I think on the dma mapping side we're finally getting ready for it now that arm uses the common dma-direct code, although getting rid of the arm iommu ops first would be even better. Note that this only solves the problem of needing the pages to hold the physical address. The other thing the pages are used for in p2p is that (through the pgmap) it points to the owning device of the p2p memory and thus allows figuring out how it needs to be accessed.
On Wed, Sep 07, 2022 at 12:23:28PM -0300, Jason Gunthorpe wrote: > 2) DMABUF abuses dma_map_resource() for P2P and thus doesn't work in > certain special cases. Not just certain special cases, but one of the main use cases. Basically P2P can happen in two ways: a) through a PCIe switch, or b) through connected root ports The open code version here only supports a), only supports it when there is no offset between the 'phyiscal' address of the BAR seen PCIe endpoint and the Linux way. x86 usually (always?) doesn't have an offset there, but other architectures often do. Last but not least I don't really see how the code would even work when an IOMMU is used, as dma_map_resource will return an IOVA that is only understood by the IOMMU itself, and not the other endpoint. How was this code even tested?
On Wed, Sep 07, 2022 at 08:32:23AM -0700, Christoph Hellwig wrote: > On Wed, Sep 07, 2022 at 12:23:28PM -0300, Jason Gunthorpe wrote: > > 2) DMABUF abuses dma_map_resource() for P2P and thus doesn't work in > > certain special cases. > > Not just certain special cases, but one of the main use cases. > Basically P2P can happen in two ways: > > a) through a PCIe switch, or > b) through connected root ports Yes, we tested both, both work. > The open code version here only supports a), only supports it when there > is no offset between the 'phyiscal' address of the BAR seen PCIe > endpoint and the Linux way. x86 usually (always?) doesn't have an > offset there, but other architectures often do. The PCI offset is some embedded thing - I've never seen it in a server platform. Frankly, it is just bad SOC design and there is good reason why non-zero needs to be avoided. As soon as you create aliases between the address spaces you invite trouble. IIRC a SOC I used once put the memory at 0 -> 4G then put the only PCI aperture at 4g -> 4g+N. However this design requires 64 bit PCI support, which at the time, the platform didn't have. So they used PCI offset to hackily alias the aperture over the DDR. I don't remember if they threw out a bit of DDR to resolve the alias, or if they just didn't support PCI switches. In any case, it is a complete mess. You either drastically limit your BAR size, don't support PCI switches or loose a lot of DDR. I also seem to remember that iommu and PCI offset don't play nice together - so for the VFIO use case where the iommu is present I'm pretty sure we can very safely assume 0 offset. That seems confirmed by the fact that VFIO has never handled PCI offset in its own P2P path and P2P works fine in VMs across a wide range of platforms. That said, I agree we should really have APIs that support this properly, and dma_map_resource is certainly technically wrong. So, would you be OK with this series if I try to make a dma_map_p2p() that resolves the offset issue? > Last but not least I don't really see how the code would even work > when an IOMMU is used, as dma_map_resource will return an IOVA that > is only understood by the IOMMU itself, and not the other endpoint. I don't understand this. __iommu_dma_map() will put the given phys into the iommu_domain associated with 'dev' and return the IOVA it picked. Here 'dev' is the importing device, it is the device that will issue the DMA: + dma_addr = dma_map_resource( + attachment->dev, + pci_resource_start(priv->vdev->pdev, priv->index) + + priv->offset, + priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC); eg attachment->dev is the PCI device of the RDMA device, not the VFIO device. 'phys' is the CPU physical of the PCI BAR page, which with 0 PCI offset is the right thing to program into the IO page table. > How was this code even tested? It was tested on a few platforms, like I said above, the cases where it doesn't work are special, largely embedded, and not anything we have in our labs - AFAIK. Jason
On 2022-09-07 16:23, Jason Gunthorpe wrote: > On Wed, Sep 07, 2022 at 07:29:58AM -0700, Christoph Hellwig wrote: >> On Wed, Sep 07, 2022 at 09:33:11AM -0300, Jason Gunthorpe wrote: >>> Yes, you said that, and I said that when the AMD driver first merged >>> it - but it went in anyhow and now people are using it in a bunch of >>> places. >> >> drm folks made up their own weird rules, if they internally stick >> to it they have to listen to it given that they ignore review comments, >> but it violates the scatterlist API and has not business anywhere >> else in the kernel. And yes, there probably is a reason or two why >> the drm code is unusually error prone. > > That may be, but it is creating problems if DRM gets to do X crazy > thing and nobody else can.. > > So, we have two issues here > > 1) DMABUF abuses the scatter list, but this is very constrainted we have > this weird special "DMABUF scatterlist" that is only touched by DMABUF > importers. The imports signal that they understand the format with > a flag. This is ugly and would be nice to clean to a dma mapped > address list of some sort. > > I spent alot of time a few years ago removing driver touches of > the SGL and preparing the RDMA stack to do this kind of change, at > least. > > 2) DMABUF abuses dma_map_resource() for P2P and thus doesn't work in > certain special cases. FWIW, dma_map_resource() *is* for P2P in general. The classic case of one device poking at another's registers that was the original motivation is a standalone DMA engine reading/writing a peripheral device's FIFO, so the very similar inter-device doorbell signal is absolutely in scope too; VRAM might be a slightly greyer area, but if it's still not page-backed kernel memory then I reckon that's fair game. The only trouble is that it's not geared for *PCI* P2P when that may or may not happen entirely upstream of IOMMU translation. Robin.
On Wed, Sep 07, 2022 at 05:31:14PM +0100, Robin Murphy wrote: > The only trouble is that it's not geared for *PCI* P2P when that may or may > not happen entirely upstream of IOMMU translation. This is why PCI users have to call the pci_distance stuff before using dma_map_resource(), it ensures the PCI fabric is setup in a way that is consistent with the iommu. eg if we have IOMMU turned on then the fabric must have ACS/etc to ensure that all TLPs are translated. PCI P2P is very complicated and fragile, sadly. Thanks, Jason
Christoph Hellwig wrote: > On Wed, Sep 07, 2022 at 09:33:11AM -0300, Jason Gunthorpe wrote: > > Yes, you said that, and I said that when the AMD driver first merged > > it - but it went in anyhow and now people are using it in a bunch of > > places. > > drm folks made up their own weird rules, if they internally stick > to it they have to listen to it given that they ignore review comments, > but it violates the scatterlist API and has not business anywhere > else in the kernel. And yes, there probably is a reason or two why > the drm code is unusually error prone. > > > > Why would small BARs be problematic for the pages? The pages are more > > > a problem for gigantic BARs do the memory overhead. > > > > How do I get a struct page * for a 4k BAR in vfio? > > I guess we have different definitions of small then :) > > But unless my understanding of the code is out out of data, > memremap_pages just requires the (virtual) start address to be 2MB > aligned, not the size. Adding Dan for comments. The minimum granularity for sparse_add_section() that memremap_pages uses internally is 2MB, so start and end need to be 2MB aligned. Details here: ba72b4c8cf60 mm/sparsemem: support sub-section hotplug
On Wed, Sep 07, 2022 at 01:12:52PM -0300, Jason Gunthorpe wrote: > The PCI offset is some embedded thing - I've never seen it in a server > platform. That's not actually true, e.g. some power system definitively had it, althiugh I don't know if the current ones do. But that's not that point. The offset is a configuration fully supported by Linux, and someone that just works by using the proper APIs. Doing some handwaiving about embedded only or bad design doesn't matter. There is a reason why we have these proper APIs and no one has any business bypassing them. > I also seem to remember that iommu and PCI offset don't play nice > together - so for the VFIO use case where the iommu is present I'm > pretty sure we can very safely assume 0 offset. That seems confirmed > by the fact that VFIO has never handled PCI offset in its own P2P path > and P2P works fine in VMs across a wide range of platforms. I think the offset is one of the reasons why IOVA windows can be reserved (and maybe also why ppc is so weird). > So, would you be OK with this series if I try to make a dma_map_p2p() > that resolves the offset issue? Well, if it also solves the other issue of invalid scatterlists leaking outside of drm we can think about it. > > > Last but not least I don't really see how the code would even work > > when an IOMMU is used, as dma_map_resource will return an IOVA that > > is only understood by the IOMMU itself, and not the other endpoint. > > I don't understand this. > > __iommu_dma_map() will put the given phys into the iommu_domain > associated with 'dev' and return the IOVA it picked. Yes, __iommu_dma_map creates an IOVA for the mapped remote BAR. That is the right thing if the I/O goes through the host bridge, but it is the wrong thing if the I/O goes through the switch - in that case the IOVA generated is not something that the endpoint that owns the BAR can even understand. Take a look at iommu_dma_map_sg and pci_p2pdma_map_segment to see how this is handled.
On Fri, Sep 09, 2022 at 06:24:35AM -0700, Christoph Hellwig wrote: > On Wed, Sep 07, 2022 at 01:12:52PM -0300, Jason Gunthorpe wrote: > > The PCI offset is some embedded thing - I've never seen it in a server > > platform. > > That's not actually true, e.g. some power system definitively had it, > althiugh I don't know if the current ones do. I thought those were all power embedded systems. > There is a reason why we have these proper APIs and no one has any > business bypassing them. Yes, we should try to support these things, but you said this patch didn't work and wasn't tested - that is not true at all. And it isn't like we have APIs just sitting here to solve this specific problem. So lets make something. > > So, would you be OK with this series if I try to make a dma_map_p2p() > > that resolves the offset issue? > > Well, if it also solves the other issue of invalid scatterlists leaking > outside of drm we can think about it. The scatterlist stuff has already leaked outside of DRM anyhow. Again, I think it is very problematic to let DRM get away with things and then insist all the poor non-DRM people be responsible to clean up their mess. I'm skeptical I can fix AMD GPU, but I can try to create a DMABUF op that returns something that is not a scatterlist and teach RDMA to use it. So at least the VFIO/RDMA part can avoid the scatter list abuse. I expected to need non-scatterlist for iommufd anyhow. Coupled with a series to add some dma_map_resource_pci() that handles the PCI_P2PDMA_MAP_BUS_ADDR and the PCI offset, would it be an agreeable direction? > Take a look at iommu_dma_map_sg and pci_p2pdma_map_segment to see how > this is handled. So there is a bug in all these DMABUF implementations, they do ignore the PCI_P2PDMA_MAP_BUS_ADDR "distance type". This isn't a real-world problem for VFIO because VFIO is largely incompatible with the non-ACS configuration that would trigger PCI_P2PDMA_MAP_BUS_ADDR, and explains why we never saw any problem. All our systems have ACS turned on so we can use VFIO. I'm unclear how Habana or AMD have avoided a problem here.. This is much more serious than the pci offset in my mind. Thanks, Jason
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index 24c524224da5a3..514c12a81127d6 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -2,6 +2,7 @@ vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o +vfio-pci-core-$(CONFIG_DMA_SHARED_BUFFER) += dma_buf.o obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o vfio-pci-y := vfio_pci.o diff --git a/drivers/vfio/pci/dma_buf.c b/drivers/vfio/pci/dma_buf.c new file mode 100644 index 00000000000000..11396aeabff405 --- /dev/null +++ b/drivers/vfio/pci/dma_buf.c @@ -0,0 +1,269 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. + */ +#include <linux/dma-buf.h> +#include <linux/pci-p2pdma.h> +#include <linux/dma-resv.h> + +#include "vfio_pci_priv.h" + +MODULE_IMPORT_NS(DMA_BUF); + +struct vfio_pci_dma_buf { + struct dma_buf *dmabuf; + struct vfio_pci_core_device *vdev; + struct list_head dmabufs_elm; + unsigned int index; + unsigned int orig_nents; + size_t offset; + bool revoked; +}; + +static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, + struct dma_buf_attachment *attachment) +{ + struct vfio_pci_dma_buf *priv = dmabuf->priv; + int rc; + + rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1, + true); + if (rc < 0) + attachment->peer2peer = false; + return 0; +} + +static void vfio_pci_dma_buf_unpin(struct dma_buf_attachment *attachment) +{ +} + +static int vfio_pci_dma_buf_pin(struct dma_buf_attachment *attachment) +{ + /* + * Uses the dynamic interface but must always allow for + * dma_buf_move_notify() to do revoke + */ + return -EINVAL; +} + +static struct sg_table * +vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment, + enum dma_data_direction dir) +{ + size_t sgl_size = dma_get_max_seg_size(attachment->dev); + struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv; + struct scatterlist *sgl; + struct sg_table *sgt; + dma_addr_t dma_addr; + unsigned int nents; + size_t offset; + int ret; + + dma_resv_assert_held(priv->dmabuf->resv); + + if (!attachment->peer2peer) + return ERR_PTR(-EPERM); + + if (priv->revoked) + return ERR_PTR(-ENODEV); + + sgt = kzalloc(sizeof(*sgt), GFP_KERNEL); + if (!sgt) + return ERR_PTR(-ENOMEM); + + nents = DIV_ROUND_UP(priv->dmabuf->size, sgl_size); + ret = sg_alloc_table(sgt, nents, GFP_KERNEL); + if (ret) + goto err_kfree_sgt; + + /* + * Since the memory being mapped is a device memory it could never be in + * CPU caches. + */ + dma_addr = dma_map_resource( + attachment->dev, + pci_resource_start(priv->vdev->pdev, priv->index) + + priv->offset, + priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC); + ret = dma_mapping_error(attachment->dev, dma_addr); + if (ret) + goto err_free_sgt; + + /* + * Break the BAR's physical range up into max sized SGL's according to + * the device's requirement. + */ + sgl = sgt->sgl; + for (offset = 0; offset != priv->dmabuf->size;) { + size_t chunk_size = min(priv->dmabuf->size - offset, sgl_size); + + sg_set_page(sgl, NULL, chunk_size, 0); + sg_dma_address(sgl) = dma_addr + offset; + sg_dma_len(sgl) = chunk_size; + sgl = sg_next(sgl); + offset += chunk_size; + } + + /* + * Because we are not going to include a CPU list we want to have some + * chance that other users will detect this by setting the orig_nents to + * 0 and using only nents (length of DMA list) when going over the sgl + */ + priv->orig_nents = sgt->orig_nents; + sgt->orig_nents = 0; + return sgt; + +err_free_sgt: + sg_free_table(sgt); +err_kfree_sgt: + kfree(sgt); + return ERR_PTR(ret); +} + +static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment, + struct sg_table *sgt, + enum dma_data_direction dir) +{ + struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv; + + sgt->orig_nents = priv->orig_nents; + dma_unmap_resource(attachment->dev, sg_dma_address(sgt->sgl), + priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC); + sg_free_table(sgt); + kfree(sgt); +} + +static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) +{ + struct vfio_pci_dma_buf *priv = dmabuf->priv; + + /* + * Either this or vfio_pci_dma_buf_cleanup() will remove from the list. + * The refcount prevents both. + */ + if (priv->vdev) { + down_write(&priv->vdev->memory_lock); + list_del_init(&priv->dmabufs_elm); + up_write(&priv->vdev->memory_lock); + vfio_device_put(&priv->vdev->vdev); + } + kfree(priv); +} + +static const struct dma_buf_ops vfio_pci_dmabuf_ops = { + .attach = vfio_pci_dma_buf_attach, + .map_dma_buf = vfio_pci_dma_buf_map, + .pin = vfio_pci_dma_buf_pin, + .unpin = vfio_pci_dma_buf_unpin, + .release = vfio_pci_dma_buf_release, + .unmap_dma_buf = vfio_pci_dma_buf_unmap, +}; + +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf __user *arg, + size_t argsz) +{ + struct vfio_device_feature_dma_buf get_dma_buf; + DEFINE_DMA_BUF_EXPORT_INFO(exp_info); + struct vfio_pci_dma_buf *priv; + int ret; + + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET, + sizeof(get_dma_buf)); + if (ret != 1) + return ret; + + if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf))) + return -EFAULT; + + /* For PCI the region_index is the BAR number like everything else */ + if (get_dma_buf.region_index >= VFIO_PCI_ROM_REGION_INDEX) + return -EINVAL; + + exp_info.ops = &vfio_pci_dmabuf_ops; + exp_info.size = pci_resource_len(vdev->pdev, get_dma_buf.region_index); + if (!exp_info.size) + return -EINVAL; + if (get_dma_buf.offset || get_dma_buf.length) { + if (get_dma_buf.length > exp_info.size || + get_dma_buf.offset >= exp_info.size || + get_dma_buf.length > exp_info.size - get_dma_buf.offset || + get_dma_buf.offset % PAGE_SIZE || + get_dma_buf.length % PAGE_SIZE) + return -EINVAL; + exp_info.size = get_dma_buf.length; + } + exp_info.flags = get_dma_buf.open_flags; + + priv = kzalloc(sizeof(*priv), GFP_KERNEL); + if (!priv) + return -ENOMEM; + INIT_LIST_HEAD(&priv->dmabufs_elm); + priv->offset = get_dma_buf.offset; + priv->index = get_dma_buf.region_index; + + exp_info.priv = priv; + priv->dmabuf = dma_buf_export(&exp_info); + if (IS_ERR(priv->dmabuf)) { + ret = PTR_ERR(priv->dmabuf); + kfree(priv); + return ret; + } + + /* dma_buf_put() now frees priv */ + + down_write(&vdev->memory_lock); + dma_resv_lock(priv->dmabuf->resv, NULL); + priv->revoked = !__vfio_pci_memory_enabled(vdev); + priv->vdev = vdev; + vfio_device_get(&vdev->vdev); + list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs); + dma_resv_unlock(priv->dmabuf->resv); + up_write(&vdev->memory_lock); + + /* + * dma_buf_fd() consumes the reference, when the file closes the dmabuf + * will be released. + */ + return dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags); +} + +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) +{ + struct vfio_pci_dma_buf *priv; + struct vfio_pci_dma_buf *tmp; + + lockdep_assert_held_write(&vdev->memory_lock); + + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { + if (!dma_buf_try_get(priv->dmabuf)) + continue; + if (priv->revoked != revoked) { + dma_resv_lock(priv->dmabuf->resv, NULL); + priv->revoked = revoked; + dma_buf_move_notify(priv->dmabuf); + dma_resv_unlock(priv->dmabuf->resv); + } + dma_buf_put(priv->dmabuf); + } +} + +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) +{ + struct vfio_pci_dma_buf *priv; + struct vfio_pci_dma_buf *tmp; + + down_write(&vdev->memory_lock); + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { + if (!dma_buf_try_get(priv->dmabuf)) + continue; + dma_resv_lock(priv->dmabuf->resv, NULL); + list_del_init(&priv->dmabufs_elm); + priv->vdev = NULL; + priv->revoked = true; + dma_buf_move_notify(priv->dmabuf); + dma_resv_unlock(priv->dmabuf->resv); + vfio_device_put(&vdev->vdev); + dma_buf_put(priv->dmabuf); + } + up_write(&vdev->memory_lock); +} diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index d22921efa25987..f8a9c24d04aeb7 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -584,10 +584,12 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos, virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY); new_mem = !!(new_cmd & PCI_COMMAND_MEMORY); - if (!new_mem) + if (!new_mem) { vfio_pci_zap_and_down_write_memory_lock(vdev); - else + vfio_pci_dma_buf_move(vdev, true); + } else { down_write(&vdev->memory_lock); + } /* * If the user is writing mem/io enable (new_mem/io) and we @@ -622,6 +624,8 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos, *virt_cmd &= cpu_to_le16(~mask); *virt_cmd |= cpu_to_le16(new_cmd & mask); + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); } diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 885706944e4d96..7e27bfc60440c2 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -504,6 +504,8 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev) vfio_spapr_pci_eeh_release(vdev->pdev); vfio_pci_core_disable(vdev); + vfio_pci_dma_buf_cleanup(vdev); + mutex_lock(&vdev->igate); if (vdev->err_trigger) { eventfd_ctx_put(vdev->err_trigger); @@ -980,7 +982,10 @@ int vfio_pci_try_reset_function(struct vfio_pci_core_device *vdev) */ vfio_pci_set_power_state(vdev, PCI_D0); + vfio_pci_dma_buf_move(vdev, true); ret = pci_try_reset_function(vdev->pdev); + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); return ret; @@ -1210,11 +1215,10 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, } EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl); -static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags, - uuid_t __user *arg, size_t argsz) +static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev, + u32 flags, uuid_t __user *arg, + size_t argsz) { - struct vfio_pci_core_device *vdev = - container_of(device, struct vfio_pci_core_device, vdev); uuid_t uuid; int ret; @@ -1241,9 +1245,14 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags, int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, void __user *arg, size_t argsz) { + struct vfio_pci_core_device *vdev = + container_of(device, struct vfio_pci_core_device, vdev); + switch (flags & VFIO_DEVICE_FEATURE_MASK) { case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN: - return vfio_pci_core_feature_token(device, flags, arg, argsz); + return vfio_pci_core_feature_token(vdev, flags, arg, argsz); + case VFIO_DEVICE_FEATURE_DMA_BUF: + return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz); default: return -ENOTTY; } @@ -1881,6 +1890,7 @@ void vfio_pci_core_init_device(struct vfio_pci_core_device *vdev, INIT_LIST_HEAD(&vdev->vma_list); INIT_LIST_HEAD(&vdev->sriov_pfs_item); init_rwsem(&vdev->memory_lock); + INIT_LIST_HEAD(&vdev->dmabufs); } EXPORT_SYMBOL_GPL(vfio_pci_core_init_device); @@ -2227,11 +2237,17 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, * cause the PCI config space reset without restoring the original * state (saved locally in 'vdev->pm_save'). */ - list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) + list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { + vfio_pci_dma_buf_move(cur, true); vfio_pci_set_power_state(cur, PCI_D0); + } ret = pci_reset_bus(pdev); + list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) + if (__vfio_pci_memory_enabled(cur)) + vfio_pci_dma_buf_move(cur, false); + err_undo: list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { if (cur == cur_mem) diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h index 5b1cb9a9838442..c295a1166e7a9c 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -102,4 +102,27 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev) return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA; } +#ifdef CONFIG_DMA_SHARED_BUFFER +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf __user *arg, + size_t argsz); +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev); +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked); +#else +static int +vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf __user *arg, + size_t argsz) +{ + return -ENOTTY; +} +static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) +{ +} +static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, + bool revoked) +{ +} +#endif + #endif diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h index e5cf0d3313a694..fb9769d9e5d149 100644 --- a/include/linux/vfio_pci_core.h +++ b/include/linux/vfio_pci_core.h @@ -93,6 +93,7 @@ struct vfio_pci_core_device { struct mutex vma_lock; struct list_head vma_list; struct rw_semaphore memory_lock; + struct list_head dmabufs; }; /* Will be exported for vfio pci drivers usage */ diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 733a1cddde30a5..1dcfad6dcb6863 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -986,6 +986,24 @@ enum vfio_device_mig_state { VFIO_DEVICE_STATE_RUNNING_P2P = 5, }; +/** + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the + * region selected. + * + * open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC, + * etc. offset/length specify a slice of the region to create the dmabuf from. + * If both are 0 then the whole region is used. + * + * Return: The fd number on success, -1 and errno is set on failure. + */ +struct vfio_device_feature_dma_buf { + __u32 region_index; + __u32 open_flags; + __u32 offset; + __u64 length; +}; +#define VFIO_DEVICE_FEATURE_DMA_BUF 3 + /* -------- API for Type1 VFIO IOMMU -------- */ /**
dma-buf has become a way to safely acquire a handle to non-struct page memory that can still have lifetime controlled by the exporter. Notably RDMA can now import dma-buf FDs and build them into MRs which allows for PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory from PCI device BARs. The patch design loosely follows the pattern in commit db1a8dd916aa ("habanalabs: add support for dma-buf exporter") except this does not support pinning. Instead, this implements what, in the past, we've called a revocable attachment using move. In normal situations the attachment is pinned, as a BAR does not change physical address. However when the VFIO device is closed, or a PCI reset is issued, access to the MMIO memory is revoked. Revoked means that move occurs, but an attempt to immediately re-map the memory will fail. In the reset case a future move will be triggered when MMIO access returns. As both close and reset are under userspace control it is expected that userspace will suspend use of the dma-buf before doing these operations, the revoke is purely for kernel self-defense against a hostile userspace. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> --- drivers/vfio/pci/Makefile | 1 + drivers/vfio/pci/dma_buf.c | 269 +++++++++++++++++++++++++++++ drivers/vfio/pci/vfio_pci_config.c | 8 +- drivers/vfio/pci/vfio_pci_core.c | 28 ++- drivers/vfio/pci/vfio_pci_priv.h | 23 +++ include/linux/vfio_pci_core.h | 1 + include/uapi/linux/vfio.h | 18 ++ 7 files changed, 340 insertions(+), 8 deletions(-) create mode 100644 drivers/vfio/pci/dma_buf.c