Message ID | 20240313105804.100168-10-cassel@kernel.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | PCI: endpoint: set prefetchable bit for 64-bit BARs | expand |
+ Arnd On Wed, Mar 13, 2024 at 11:58:01AM +0100, Niklas Cassel wrote: > From the PCIe 6.0 base spec: It'd be good to mention the section also. > "Generally only 64-bit BARs are good candidates, since only Legacy > Endpoints are permitted to set the Prefetchable bit in 32-bit BARs, > and most scalable platforms map all 32-bit Memory BARs into > non-prefetchable Memory Space regardless of the Prefetchable bit value." > > "For a PCI Express Endpoint, 64-bit addressing must be supported for all > BARs that have the Prefetchable bit Set. 32-bit addressing is permitted > for all BARs that do not have the Prefetchable bit Set." > > "Any device that has a range that behaves like normal memory should mark > the range as prefetchable. A linear frame buffer in a graphics device is > an example of a range that should be marked prefetchable." > > The PCIe spec tells us that we should have the prefetchable bit set for > 64-bit BARs backed by "normal memory". The backing memory that we allocate > for a 64-bit BAR using pci_epf_alloc_space() (which calls > dma_alloc_coherent()) is obviously "normal memory". > I'm not sure this is correct. Memory returned by 'dma_alloc_coherent' is not the 'normal memory' but rather 'consistent/coherent memory'. Here the question is, can the memory returned by dma_alloc_coherent() be prefetched or write-combined on all architectures. I hope Arnd can answer this question. - Mani > Thus, set the prefetchable bit when allocating backing memory for a 64-bit > BAR. > > Signed-off-by: Niklas Cassel <cassel@kernel.org> > --- > drivers/pci/endpoint/pci-epf-core.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/pci/endpoint/pci-epf-core.c b/drivers/pci/endpoint/pci-epf-core.c > index e7dbbeb1f0de..20d2bde0747c 100644 > --- a/drivers/pci/endpoint/pci-epf-core.c > +++ b/drivers/pci/endpoint/pci-epf-core.c > @@ -309,6 +309,9 @@ void *pci_epf_alloc_space(struct pci_epf *epf, size_t size, enum pci_barno bar, > else > epf_bar[bar].flags |= PCI_BASE_ADDRESS_MEM_TYPE_32; > > + if (epf_bar[bar].flags & PCI_BASE_ADDRESS_MEM_TYPE_64) > + epf_bar[bar].flags |= PCI_BASE_ADDRESS_MEM_PREFETCH; > + > return space; > } > EXPORT_SYMBOL_GPL(pci_epf_alloc_space); > -- > 2.44.0 >
On Fri, Mar 15, 2024, at 07:44, Manivannan Sadhasivam wrote: > On Wed, Mar 13, 2024 at 11:58:01AM +0100, Niklas Cassel wrote: >> "Generally only 64-bit BARs are good candidates, since only Legacy >> Endpoints are permitted to set the Prefetchable bit in 32-bit BARs, >> and most scalable platforms map all 32-bit Memory BARs into >> non-prefetchable Memory Space regardless of the Prefetchable bit value." >> >> "For a PCI Express Endpoint, 64-bit addressing must be supported for all >> BARs that have the Prefetchable bit Set. 32-bit addressing is permitted >> for all BARs that do not have the Prefetchable bit Set." >> >> "Any device that has a range that behaves like normal memory should mark >> the range as prefetchable. A linear frame buffer in a graphics device is >> an example of a range that should be marked prefetchable." >> >> The PCIe spec tells us that we should have the prefetchable bit set for >> 64-bit BARs backed by "normal memory". The backing memory that we allocate >> for a 64-bit BAR using pci_epf_alloc_space() (which calls >> dma_alloc_coherent()) is obviously "normal memory". >> > > I'm not sure this is correct. Memory returned by 'dma_alloc_coherent' is not the > 'normal memory' but rather 'consistent/coherent memory'. Here the question is, > can the memory returned by dma_alloc_coherent() be prefetched or write-combined > on all architectures. > > I hope Arnd can answer this question. I think there are three separate questions here when talking about a scenario where a PCI master accesses memory behind a PCI endpoint: - The CPU on the host side ususally uses ioremap() for mapping the PCI BAR of the device. If the BAR is marked as prefetchable, we usually allow mapping it using ioremap_wc() for write-combining or ioremap_wt() for a write-through mappings that allow both write-combining and prefetching. On some architectures, these all fall back to normal register mappings which do none of these. If it uses write-combining or prefetching, the host side driver will need to manually serialize against concurrent access from the endpoint side. - The endpoint device accessing a buffer in memory is controlled by the endpoint driver and may decide to prefetch data into a local cache independent of the other two. I don't know if any of the suppored endpoint devices actually do that. A prefetch from the PCI host side would appear as a normal transaction here. - The local CPU on the endpoint side may access the same buffer as the endpoint device. On low-end SoCs the DMA from the PCI endpoint is not coherent with the CPU caches, so the CPU may need to map it as uncacheable to allow data consistency with a the CPU on the PCI host side. On higher-end SoCs (e.g. most non-ARM ones) DMA is coherent with the caches, so the CPU on the endpoint side may map the buffer as cached and still be coherent with a CPU on the PCI host side that has mapped it with ioremap(). Arnd
Hello all, On Fri, Mar 15, 2024 at 06:29:52PM +0100, Arnd Bergmann wrote: > On Fri, Mar 15, 2024, at 07:44, Manivannan Sadhasivam wrote: > > On Wed, Mar 13, 2024 at 11:58:01AM +0100, Niklas Cassel wrote: > >> "Generally only 64-bit BARs are good candidates, since only Legacy > >> Endpoints are permitted to set the Prefetchable bit in 32-bit BARs, > >> and most scalable platforms map all 32-bit Memory BARs into > >> non-prefetchable Memory Space regardless of the Prefetchable bit value." > >> > >> "For a PCI Express Endpoint, 64-bit addressing must be supported for all > >> BARs that have the Prefetchable bit Set. 32-bit addressing is permitted > >> for all BARs that do not have the Prefetchable bit Set." > >> > >> "Any device that has a range that behaves like normal memory should mark > >> the range as prefetchable. A linear frame buffer in a graphics device is > >> an example of a range that should be marked prefetchable." > >> > >> The PCIe spec tells us that we should have the prefetchable bit set for > >> 64-bit BARs backed by "normal memory". The backing memory that we allocate > >> for a 64-bit BAR using pci_epf_alloc_space() (which calls > >> dma_alloc_coherent()) is obviously "normal memory". > >> > > > > I'm not sure this is correct. Memory returned by 'dma_alloc_coherent' is not the > > 'normal memory' but rather 'consistent/coherent memory'. Here the question is, > > can the memory returned by dma_alloc_coherent() be prefetched or write-combined > > on all architectures. > > > > I hope Arnd can answer this question. > > I think there are three separate questions here when talking about > a scenario where a PCI master accesses memory behind a PCI endpoint: I think the question is if the PCI epf-core, which runs on the endpoint side, and which calls dma_alloc_coherent() to allocate backing memory for a BAR, can set/mark the Prefetchable bit for the BAR (if we also set/mark the BAR as a 64-bit BAR). The PCIe 6.0 spec, 7.5.1.2.1 Base Address Registers (Offset 10h - 24h), states: "Any device that has a range that behaves like normal memory should mark the range as prefetchable. A linear frame buffer in a graphics device is an example of a range that should be marked prefetchable." Does not backing memory allocated for a specific BAR using dma_alloc_coherent() on the EP side behave like normal memory from the host's point of view? On the host side, this will mean that the host driver sees the Prefetchable bit, and as according to: https://docs.kernel.org/driver-api/device-io.html The host might map the BAR using ioremap_wc(). Looking specifically at drivers/misc/pci_endpoint_test.c, it maps the BARs using pci_ioremap_bar(): https://elixir.bootlin.com/linux/v6.8/source/drivers/pci/pci.c#L252 which will not map it using ioremap_wc(). (But the code we have in the PCI epf-core must of course work with host side drivers other than pci_endpoint_test.c as well.) > > - The CPU on the host side ususally uses ioremap() for mapping > the PCI BAR of the device. If the BAR is marked as prefetchable, > we usually allow mapping it using ioremap_wc() for write-combining > or ioremap_wt() for a write-through mappings that allow both > write-combining and prefetching. On some architectures, these > all fall back to normal register mappings which do none of these. > If it uses write-combining or prefetching, the host side driver > will need to manually serialize against concurrent access from > the endpoint side. > > - The endpoint device accessing a buffer in memory is controlled > by the endpoint driver and may decide to prefetch data into a > local cache independent of the other two. I don't know if any > of the suppored endpoint devices actually do that. A prefetch > from the PCI host side would appear as a normal transaction here. > > - The local CPU on the endpoint side may access the same buffer as > the endpoint device. On low-end SoCs the DMA from the PCI > endpoint is not coherent with the CPU caches, so the CPU may I don't follow. When doing DMA *from* the endpoint, then the DMA HW on the EP side will read or write data to a buffer allocated on the host side (most likely using dma_alloc_coherent()), but what does that got to do with how the EP configures the BARs that it exposes? > need to map it as uncacheable to allow data consistency with > a the CPU on the PCI host side. On higher-end SoCs (e.g. most > non-ARM ones) DMA is coherent with the caches, so the CPU > on the endpoint side may map the buffer as cached and > still be coherent with a CPU on the PCI host side that has > mapped it with ioremap(). Kind regards, Niklas
On Sun, Mar 17, 2024 at 12:54:11PM +0100, Niklas Cassel wrote: > Hello all, > > On Fri, Mar 15, 2024 at 06:29:52PM +0100, Arnd Bergmann wrote: > > On Fri, Mar 15, 2024, at 07:44, Manivannan Sadhasivam wrote: > > > On Wed, Mar 13, 2024 at 11:58:01AM +0100, Niklas Cassel wrote: > > >> "Generally only 64-bit BARs are good candidates, since only Legacy > > >> Endpoints are permitted to set the Prefetchable bit in 32-bit BARs, > > >> and most scalable platforms map all 32-bit Memory BARs into > > >> non-prefetchable Memory Space regardless of the Prefetchable bit value." > > >> > > >> "For a PCI Express Endpoint, 64-bit addressing must be supported for all > > >> BARs that have the Prefetchable bit Set. 32-bit addressing is permitted > > >> for all BARs that do not have the Prefetchable bit Set." > > >> > > >> "Any device that has a range that behaves like normal memory should mark > > >> the range as prefetchable. A linear frame buffer in a graphics device is > > >> an example of a range that should be marked prefetchable." > > >> > > >> The PCIe spec tells us that we should have the prefetchable bit set for > > >> 64-bit BARs backed by "normal memory". The backing memory that we allocate > > >> for a 64-bit BAR using pci_epf_alloc_space() (which calls > > >> dma_alloc_coherent()) is obviously "normal memory". > > >> > > > > > > I'm not sure this is correct. Memory returned by 'dma_alloc_coherent' is not the > > > 'normal memory' but rather 'consistent/coherent memory'. Here the question is, > > > can the memory returned by dma_alloc_coherent() be prefetched or write-combined > > > on all architectures. > > > > > > I hope Arnd can answer this question. > > > > I think there are three separate questions here when talking about > > a scenario where a PCI master accesses memory behind a PCI endpoint: > > I think the question is if the PCI epf-core, which runs on the endpoint > side, and which calls dma_alloc_coherent() to allocate backing memory for > a BAR, can set/mark the Prefetchable bit for the BAR (if we also set/mark > the BAR as a 64-bit BAR). > > The PCIe 6.0 spec, 7.5.1.2.1 Base Address Registers (Offset 10h - 24h), > states: > "Any device that has a range that behaves like normal memory should mark > the range as prefetchable. A linear frame buffer in a graphics device is > an example of a range that should be marked prefetchable." > > Does not backing memory allocated for a specific BAR using > dma_alloc_coherent() on the EP side behave like normal memory from the > host's point of view? > > > > On the host side, this will mean that the host driver sees the > Prefetchable bit, and as according to: > https://docs.kernel.org/driver-api/device-io.html > The host might map the BAR using ioremap_wc(). > > Looking specifically at drivers/misc/pci_endpoint_test.c, it maps the > BARs using pci_ioremap_bar(): > https://elixir.bootlin.com/linux/v6.8/source/drivers/pci/pci.c#L252 > which will not map it using ioremap_wc(). > (But the code we have in the PCI epf-core must of course work with host > side drivers other than pci_endpoint_test.c as well.) > > Right. I don't see any problem with the host side assumption. But my question is, is it OK to advertise the coherent memory allocated on the endpoint as prefetchable to the host. As you quoted the spec, "Any device that has a range that behaves like normal memory should mark the range as prefetchable." Here, the coherent memory allocated by the device(endpoint) won't behave as a normal memory on the _endpoint_. But I'm certainly not sure if there are any implications in exposing this memory as a 'normal memory' to the host. - Mani > > > > - The CPU on the host side ususally uses ioremap() for mapping > > the PCI BAR of the device. If the BAR is marked as prefetchable, > > we usually allow mapping it using ioremap_wc() for write-combining > > or ioremap_wt() for a write-through mappings that allow both > > write-combining and prefetching. On some architectures, these > > all fall back to normal register mappings which do none of these. > > If it uses write-combining or prefetching, the host side driver > > will need to manually serialize against concurrent access from > > the endpoint side. > > > > - The endpoint device accessing a buffer in memory is controlled > > by the endpoint driver and may decide to prefetch data into a > > local cache independent of the other two. I don't know if any > > of the suppored endpoint devices actually do that. A prefetch > > from the PCI host side would appear as a normal transaction here. > > > > - The local CPU on the endpoint side may access the same buffer as > > the endpoint device. On low-end SoCs the DMA from the PCI > > endpoint is not coherent with the CPU caches, so the CPU may > > I don't follow. When doing DMA *from* the endpoint, then the DMA HW > on the EP side will read or write data to a buffer allocated on the > host side (most likely using dma_alloc_coherent()), but what does > that got to do with how the EP configures the BARs that it exposes? > > > > need to map it as uncacheable to allow data consistency with > > a the CPU on the PCI host side. On higher-end SoCs (e.g. most > > non-ARM ones) DMA is coherent with the caches, so the CPU > > on the endpoint side may map the buffer as cached and > > still be coherent with a CPU on the PCI host side that has > > mapped it with ioremap(). > > > Kind regards, > Niklas
On Fri, Mar 15, 2024 at 06:29:52PM +0100, Arnd Bergmann wrote: > On Fri, Mar 15, 2024, at 07:44, Manivannan Sadhasivam wrote: > > On Wed, Mar 13, 2024 at 11:58:01AM +0100, Niklas Cassel wrote: > >> "Generally only 64-bit BARs are good candidates, since only Legacy > >> Endpoints are permitted to set the Prefetchable bit in 32-bit BARs, > >> and most scalable platforms map all 32-bit Memory BARs into > >> non-prefetchable Memory Space regardless of the Prefetchable bit value." > >> > >> "For a PCI Express Endpoint, 64-bit addressing must be supported for all > >> BARs that have the Prefetchable bit Set. 32-bit addressing is permitted > >> for all BARs that do not have the Prefetchable bit Set." > >> > >> "Any device that has a range that behaves like normal memory should mark > >> the range as prefetchable. A linear frame buffer in a graphics device is > >> an example of a range that should be marked prefetchable." > >> > >> The PCIe spec tells us that we should have the prefetchable bit set for > >> 64-bit BARs backed by "normal memory". The backing memory that we allocate > >> for a 64-bit BAR using pci_epf_alloc_space() (which calls > >> dma_alloc_coherent()) is obviously "normal memory". > >> > > > > I'm not sure this is correct. Memory returned by 'dma_alloc_coherent' is not the > > 'normal memory' but rather 'consistent/coherent memory'. Here the question is, > > can the memory returned by dma_alloc_coherent() be prefetched or write-combined > > on all architectures. > > > > I hope Arnd can answer this question. > > I think there are three separate questions here when talking about > a scenario where a PCI master accesses memory behind a PCI endpoint: > > - The CPU on the host side ususally uses ioremap() for mapping > the PCI BAR of the device. If the BAR is marked as prefetchable, > we usually allow mapping it using ioremap_wc() for write-combining > or ioremap_wt() for a write-through mappings that allow both > write-combining and prefetching. On some architectures, these > all fall back to normal register mappings which do none of these. > If it uses write-combining or prefetching, the host side driver > will need to manually serialize against concurrent access from > the endpoint side. > > - The endpoint device accessing a buffer in memory is controlled > by the endpoint driver and may decide to prefetch data into a > local cache independent of the other two. I don't know if any > of the suppored endpoint devices actually do that. A prefetch > from the PCI host side would appear as a normal transaction here. > > - The local CPU on the endpoint side may access the same buffer as > the endpoint device. On low-end SoCs the DMA from the PCI > endpoint is not coherent with the CPU caches, so the CPU may > need to map it as uncacheable to allow data consistency with > a the CPU on the PCI host side. On higher-end SoCs (e.g. most > non-ARM ones) DMA is coherent with the caches, so the CPU > on the endpoint side may map the buffer as cached and > still be coherent with a CPU on the PCI host side that has > mapped it with ioremap(). > Thanks Arnd for the reply. But I'm not sure I got the answer I was looking for. So let me rephrase my question a bit. For BAR memory, PCIe spec states that, 'A PCI Express Function requesting Memory Space through a BAR must set the BAR's Prefetchable bit unless the range contains locations with read side effects or locations in which the Function does not tolerate write merging' So here, spec refers the backing memory allocated on the endpoint side as the 'range' i.e, the BAR memory allocated on the host that gets mapped on the endpoint. Currently on the endpoint side, we use dma_alloc_coherent() to allocate the memory for each BAR and map it using iATU. So I want to know if the memory range allocated in the endpoint through dma_alloc_coherent() satisfies the above two conditions in PCIe spec on all architectures: 1. No Read side effects 2. Tolerates write merging I believe the reason why we are allocating the coherent memory on the endpoint first up is not all PCIe controllers are DMA coherent as you said above. - Mani
On Mon, Mar 18, 2024, at 05:30, Manivannan Sadhasivam wrote: > On Fri, Mar 15, 2024 at 06:29:52PM +0100, Arnd Bergmann wrote: >> On Fri, Mar 15, 2024, at 07:44, Manivannan Sadhasivam wrote: >> > On Wed, Mar 13, 2024 at 11:58:01AM +0100, Niklas Cassel wrote: > > But I'm not sure I got the answer I was looking for. So let me rephrase my > question a bit. > > For BAR memory, PCIe spec states that, > > 'A PCI Express Function requesting Memory Space through a BAR must set the BAR's > Prefetchable bit unless the range contains locations with read side effects or > locations in which the Function does not tolerate write merging' > > So here, spec refers the backing memory allocated on the endpoint side as the > 'range' i.e, the BAR memory allocated on the host that gets mapped on the > endpoint. > > Currently on the endpoint side, we use dma_alloc_coherent() to allocate the > memory for each BAR and map it using iATU. > > So I want to know if the memory range allocated in the endpoint through > dma_alloc_coherent() satisfies the above two conditions in PCIe spec on all > architectures: > > 1. No Read side effects > 2. Tolerates write merging > > I believe the reason why we are allocating the coherent memory on the endpoint > first up is not all PCIe controllers are DMA coherent as you said above. As far as I can tell, we never have read side effects for memory backed BARs, but the write merging is something that depends on how the memory is used: If you have anything in that memory that relies on ordering, you probably want to map it as coherent on the endpoint side, and non-prefetchable on the host controller side, and then use the normal rmb()/wmb() barriers on both ends between serialized accesses. An example of this would be having blocks of data separate from metadata that says whether the data is valid. If you don't care about ordering on that level, I would use dma_map_sg() on the endpoint side and prefetchable mapping on the host side, with the endpoint using dma_sync_*() to pass buffer ownership between the two sides, as controlled by some other communication method (non-prefetchable BAR, MSI, ...). Arnd
On Sun, Mar 17, 2024, at 12:54, Niklas Cassel wrote: > On Fri, Mar 15, 2024 at 06:29:52PM +0100, Arnd Bergmann wrote: >> On Fri, Mar 15, 2024, at 07:44, Manivannan Sadhasivam wrote: >> >> I think there are three separate questions here when talking about >> a scenario where a PCI master accesses memory behind a PCI endpoint: > > I think the question is if the PCI epf-core, which runs on the endpoint > side, and which calls dma_alloc_coherent() to allocate backing memory for > a BAR, can set/mark the Prefetchable bit for the BAR (if we also set/mark > the BAR as a 64-bit BAR). > > The PCIe 6.0 spec, 7.5.1.2.1 Base Address Registers (Offset 10h - 24h), > states: > "Any device that has a range that behaves like normal memory should mark > the range as prefetchable. A linear frame buffer in a graphics device is > an example of a range that should be marked prefetchable." > > Does not backing memory allocated for a specific BAR using > dma_alloc_coherent() on the EP side behave like normal memory from the > host's point of view? I'm not sure I follow this logic: If the device wants the buffer to act like "normal memory", then it can be marked as prefetchable and mapped into the host as write-combining, but I think in this case you *don't* want it to be coherent on the endpoint side either but use a streaming mapping with explicit cache management instead. Conversely, if the endpoint side requires a coherent mapping, then I think you will want a strictly ordered (non-wc, non-frefetchable) mapping on the host side as well. It would be helpful to have actual endpoint function drivers in the kernel rather than just the test drivers to see what type of serialization you actually want for best performance on both sides. Can you give a specific example of an endpoint that you are actually interested in, maybe just one that we have a host-side device driver for in tree? > On the host side, this will mean that the host driver sees the > Prefetchable bit, and as according to: > https://docs.kernel.org/driver-api/device-io.html > The host might map the BAR using ioremap_wc(). > > Looking specifically at drivers/misc/pci_endpoint_test.c, it maps the > BARs using pci_ioremap_bar(): > https://elixir.bootlin.com/linux/v6.8/source/drivers/pci/pci.c#L252 > which will not map it using ioremap_wc(). > (But the code we have in the PCI epf-core must of course work with host > side drivers other than pci_endpoint_test.c as well.) It is to some degree architecture specific here. On powerpc and i386 with MTTRs, any prefetchable BAR will be mapped as write-combining IIRC, but on arm and arm64 it only depends on whether the host side driver uses ioremap() or ioremap_wc(). >> - The local CPU on the endpoint side may access the same buffer as >> the endpoint device. On low-end SoCs the DMA from the PCI >> endpoint is not coherent with the CPU caches, so the CPU may > > I don't follow. When doing DMA *from* the endpoint, then the DMA HW > on the EP side will read or write data to a buffer allocated on the > host side (most likely using dma_alloc_coherent()), but what does > that got to do with how the EP configures the BARs that it exposes? I meant doing DMA to the memory of the endpoint side, not the host side. DMA to the host side memory is completely separate from this question. Arnd
Hello Arnd, On Mon, Mar 18, 2024 at 08:25:36AM +0100, Arnd Bergmann wrote: > > I'm not sure I follow this logic: If the device wants the > buffer to act like "normal memory", then it can be marked > as prefetchable and mapped into the host as write-combining, > but I think in this case you *don't* want it to be coherent > on the endpoint side either but use a streaming mapping with > explicit cache management instead. > > Conversely, if the endpoint side requires a coherent mapping, > then I think you will want a strictly ordered (non-wc, > non-frefetchable) mapping on the host side as well. > > It would be helpful to have actual endpoint function drivers > in the kernel rather than just the test drivers to see what type > of serialization you actually want for best performance on > both sides. Yes, that would be nice. This specific API, pci_epf_alloc_space(), is only used by the following drivers: drivers/pci/endpoint/functions/pci-epf-test.c drivers/pci/endpoint/functions/pci-epf-ntb.c drivers/pci/endpoint/functions/pci-epf-vntb.c pci_epf_alloc_space() is only used to allocate backing memory for the BARs. > > Can you give a specific example of an endpoint that you are > actually interested in, maybe just one that we have a host-side > device driver for in tree? I personally just care about pci-epf-test, but obviously I don't want to regress any other user of pci_epf_alloc_space(). Looking at the endpoint side driver: drivers/pci/endpoint/functions/pci-epf-test.c and the host side driver: drivers/misc/pci_endpoint_test.c On the RC side, allocating buffers that the EP will DMA to is done using: kzalloc() + dma_map_single(). On EP side: drivers/pci/endpoint/functions/pci-epf-test.c uses dma_map_single() when using DMA, and signals completion using MSI. On EP side: When reading/writing to the BARs, it simply does: READ_ONCE()/WRITE_ONCE(): https://github.com/torvalds/linux/blob/v6.8/drivers/pci/endpoint/functions/pci-epf-test.c#L643-L648 There is no dma_sync(), so the pci-test-epf driver currently seems to depend on the backing memory being allocated by dma_alloc_coherent(). > If you don't care about ordering on that level, I would use > dma_map_sg() on the endpoint side and prefetchable mapping on > the host side, with the endpoint using dma_sync_*() to pass > buffer ownership between the two sides, as controlled by some > other communication method (non-prefetchable BAR, MSI, ...). I don't think that there is no big reason why pci-epf-test is implemented using dma_alloc_coherent() rather than dma_sync() for the memory backing the BARs, but that is the way it is. Since I don't feel like totally rewriting pci-epf-test, and since you say that we shouldn't use dma_alloc_coherent() for the memory backing the BARs together with exporting the BAR as prefetchable, I will drop this patch from the series in the next revision. Kind regards, Niklas
On Mon, Mar 18, 2024, at 16:13, Niklas Cassel wrote: > On Mon, Mar 18, 2024 at 08:25:36AM +0100, Arnd Bergmann wrote: > > I personally just care about pci-epf-test, but obviously I don't > want to regress any other user of pci_epf_alloc_space(). > > Looking at the endpoint side driver: > drivers/pci/endpoint/functions/pci-epf-test.c > and the host side driver: > drivers/misc/pci_endpoint_test.c > > On the RC side, allocating buffers that the EP will DMA to is > done using: kzalloc() + dma_map_single(). > > On EP side: > drivers/pci/endpoint/functions/pci-epf-test.c > uses dma_map_single() when using DMA, and signals completion using MSI. > > On EP side: > When reading/writing to the BARs, it simply does: > READ_ONCE()/WRITE_ONCE(): > https://github.com/torvalds/linux/blob/v6.8/drivers/pci/endpoint/functions/pci-epf-test.c#L643-L648 > > There is no dma_sync(), so the pci-test-epf driver currently seems to > depend on the backing memory being allocated by dma_alloc_coherent(). From my reading of that function, this is really some kind of command buffer that implements individual structured registers and can be accessed from both sides at the same time, so it would not actually make sense with the streaming interface and wc/prefetchable access in place of explicit READ_ONCE/WRITE_ONCE and readl/writel accesses. >> If you don't care about ordering on that level, I would use >> dma_map_sg() on the endpoint side and prefetchable mapping on >> the host side, with the endpoint using dma_sync_*() to pass >> buffer ownership between the two sides, as controlled by some >> other communication method (non-prefetchable BAR, MSI, ...). > > I don't think that there is no big reason why pci-epf-test is > implemented using dma_alloc_coherent() rather than dma_sync() > for the memory backing the BARs, but that is the way it is. > > Since I don't feel like totally rewriting pci-epf-test, and since > you say that we shouldn't use dma_alloc_coherent() for the memory > backing the BARs together with exporting the BAR as prefetchable, > I will drop this patch from the series in the next revision. Ok. It might still be useful to extend the driver to also allow transferring streaming data through a BAR on the endpoint side. From what I can tell, it currently supports using either slave DMA or a RC side buffer that ioremapped into the endpoint, but that uses a regular ioremap() as well. Mapping the RC side buffer as WC should make it possible to transfer data from EP to RC more efficiently, but for the RC to EP transfers you really want the buffer to be allocated on the EP, so you can ioremap_wc() it to the RC for a memcpy_toio, or cacheable read from the EP. Arnd
On Mon, Mar 18, 2024 at 07:44:21AM +0100, Arnd Bergmann wrote: > On Mon, Mar 18, 2024, at 05:30, Manivannan Sadhasivam wrote: > > On Fri, Mar 15, 2024 at 06:29:52PM +0100, Arnd Bergmann wrote: > >> On Fri, Mar 15, 2024, at 07:44, Manivannan Sadhasivam wrote: > >> > On Wed, Mar 13, 2024 at 11:58:01AM +0100, Niklas Cassel wrote: > > > > But I'm not sure I got the answer I was looking for. So let me rephrase my > > question a bit. > > > > For BAR memory, PCIe spec states that, > > > > 'A PCI Express Function requesting Memory Space through a BAR must set the BAR's > > Prefetchable bit unless the range contains locations with read side effects or > > locations in which the Function does not tolerate write merging' > > > > So here, spec refers the backing memory allocated on the endpoint side as the > > 'range' i.e, the BAR memory allocated on the host that gets mapped on the > > endpoint. > > > > Currently on the endpoint side, we use dma_alloc_coherent() to allocate the > > memory for each BAR and map it using iATU. > > > > So I want to know if the memory range allocated in the endpoint through > > dma_alloc_coherent() satisfies the above two conditions in PCIe spec on all > > architectures: > > > > 1. No Read side effects > > 2. Tolerates write merging > > > > I believe the reason why we are allocating the coherent memory on the endpoint > > first up is not all PCIe controllers are DMA coherent as you said above. > > As far as I can tell, we never have read side effects for memory > backed BARs, but the write merging is something that depends on > how the memory is used: > > If you have anything in that memory that relies on ordering, > you probably want to map it as coherent on the endpoint side, > and non-prefetchable on the host controller side, and then > use the normal rmb()/wmb() barriers on both ends between > serialized accesses. An example of this would be having blocks > of data separate from metadata that says whether the data is > valid. > > If you don't care about ordering on that level, I would use > dma_map_sg() on the endpoint side and prefetchable mapping on > the host side, with the endpoint using dma_sync_*() to pass > buffer ownership between the two sides, as controlled by some > other communication method (non-prefetchable BAR, MSI, ...). > Right now, there are only Test and a couple of NTB drivers making use of the pci_epf_alloc_space() API and they do not need streaming DMA. So to conclude, we should just live with coherent allocation/non-prefetch for now and extend it to streaming DMA/prefetch once we have a function driver that needs it. Thanks a lot for your inputs! - Mani
On Mon, Mar 18, 2024 at 04:49:07PM +0100, Arnd Bergmann wrote: > On Mon, Mar 18, 2024, at 16:13, Niklas Cassel wrote: > > On Mon, Mar 18, 2024 at 08:25:36AM +0100, Arnd Bergmann wrote: > > > > I personally just care about pci-epf-test, but obviously I don't > > want to regress any other user of pci_epf_alloc_space(). > > > > Looking at the endpoint side driver: > > drivers/pci/endpoint/functions/pci-epf-test.c > > and the host side driver: > > drivers/misc/pci_endpoint_test.c > > > > On the RC side, allocating buffers that the EP will DMA to is > > done using: kzalloc() + dma_map_single(). > > > > On EP side: > > drivers/pci/endpoint/functions/pci-epf-test.c > > uses dma_map_single() when using DMA, and signals completion using MSI. > > > > On EP side: > > When reading/writing to the BARs, it simply does: > > READ_ONCE()/WRITE_ONCE(): > > https://github.com/torvalds/linux/blob/v6.8/drivers/pci/endpoint/functions/pci-epf-test.c#L643-L648 > > > > There is no dma_sync(), so the pci-test-epf driver currently seems to > > depend on the backing memory being allocated by dma_alloc_coherent(). > > From my reading of that function, this is really some kind > of command buffer that implements individual structured > registers and can be accessed from both sides at the same > time, so it would not actually make sense with the streaming > interface and wc/prefetchable access in place of explicit > READ_ONCE/WRITE_ONCE and readl/writel accesses. > Right. We should stick to the current implementation for now until a function driver with streaming DMA usecase comes in. - Mani > >> If you don't care about ordering on that level, I would use > >> dma_map_sg() on the endpoint side and prefetchable mapping on > >> the host side, with the endpoint using dma_sync_*() to pass > >> buffer ownership between the two sides, as controlled by some > >> other communication method (non-prefetchable BAR, MSI, ...). > > > > I don't think that there is no big reason why pci-epf-test is > > implemented using dma_alloc_coherent() rather than dma_sync() > > for the memory backing the BARs, but that is the way it is. > > > > Since I don't feel like totally rewriting pci-epf-test, and since > > you say that we shouldn't use dma_alloc_coherent() for the memory > > backing the BARs together with exporting the BAR as prefetchable, > > I will drop this patch from the series in the next revision. > > Ok. It might still be useful to extend the driver to also > allow transferring streaming data through a BAR on the > endpoint side. From what I can tell, it currently supports > using either slave DMA or a RC side buffer that ioremapped > into the endpoint, but that uses a regular ioremap() as well. > Mapping the RC side buffer as WC should make it possible to > transfer data from EP to RC more efficiently, but for the RC > to EP transfers you really want the buffer to be allocated on > the EP, so you can ioremap_wc() it to the RC for a memcpy_toio, > or cacheable read from the EP. > > Arnd
diff --git a/drivers/pci/endpoint/pci-epf-core.c b/drivers/pci/endpoint/pci-epf-core.c index e7dbbeb1f0de..20d2bde0747c 100644 --- a/drivers/pci/endpoint/pci-epf-core.c +++ b/drivers/pci/endpoint/pci-epf-core.c @@ -309,6 +309,9 @@ void *pci_epf_alloc_space(struct pci_epf *epf, size_t size, enum pci_barno bar, else epf_bar[bar].flags |= PCI_BASE_ADDRESS_MEM_TYPE_32; + if (epf_bar[bar].flags & PCI_BASE_ADDRESS_MEM_TYPE_64) + epf_bar[bar].flags |= PCI_BASE_ADDRESS_MEM_PREFETCH; + return space; } EXPORT_SYMBOL_GPL(pci_epf_alloc_space);
From the PCIe 6.0 base spec: "Generally only 64-bit BARs are good candidates, since only Legacy Endpoints are permitted to set the Prefetchable bit in 32-bit BARs, and most scalable platforms map all 32-bit Memory BARs into non-prefetchable Memory Space regardless of the Prefetchable bit value." "For a PCI Express Endpoint, 64-bit addressing must be supported for all BARs that have the Prefetchable bit Set. 32-bit addressing is permitted for all BARs that do not have the Prefetchable bit Set." "Any device that has a range that behaves like normal memory should mark the range as prefetchable. A linear frame buffer in a graphics device is an example of a range that should be marked prefetchable." The PCIe spec tells us that we should have the prefetchable bit set for 64-bit BARs backed by "normal memory". The backing memory that we allocate for a 64-bit BAR using pci_epf_alloc_space() (which calls dma_alloc_coherent()) is obviously "normal memory". Thus, set the prefetchable bit when allocating backing memory for a 64-bit BAR. Signed-off-by: Niklas Cassel <cassel@kernel.org> --- drivers/pci/endpoint/pci-epf-core.c | 3 +++ 1 file changed, 3 insertions(+)