Message ID | 20241012113246.95634-1-dlemoal@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | Improve PCI memory mapping API | expand |
On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote: > This series introduces the new functions pci_epc_mem_map() and > pci_epc_mem_unmap() to improve handling of the PCI address mapping > alignment constraints of endpoint controllers in a controller > independent manner. > > The issue fixed is that the fixed alignment defined by the "align" field > of struct pci_epc_features is defined for inbound ATU entries (e.g. > BARs) and is a fixed value, whereas some controllers need a PCI address > alignment that depends on the PCI address itself. For instance, the > rk3399 SoC controller in endpoint mode uses the lower bits of the local > endpoint memory address as the lower bits for the PCI addresses for data > transfers. That is, when mapping local memory, one must take into > account the number of bits of the RC PCI address that change from the > start address of the mapping. > > To fix this, the new endpoint controller method .align_addr is > introduced and called from the new pci_epc_mem_map() function. This > method is optional and for controllers that do not define it, it is > assumed that the controller has no PCI address constraint. > > The functions pci_epc_mem_map() is a helper function which obtains > the mapping information, allocates endpoint controller memory according > to the mapping size obtained and maps the memory. pci_epc_mem_unmap() > unmaps and frees the endpoint memory. > > This series is organized as follows: > - Patch 1 introduces a small helper to clean up the epc core code > - Patch 2 improves pci_epc_mem_alloc_addr() > - Patch 3 introduce the new align_addr() endpoint controller method > and the epc functions pci_epc_mem_map() and pci_epc_mem_unmap(). > - Patch 4 documents these new functions. > - Patch 5 modifies the test endpoint function driver to use > pci_epc_mem_map() and pci_epc_mem_unmap() to illustrate the use of > these functions. > - Finally, patch 6 implements the RK3588 endpoint controller driver > .align_addr() operation to satisfy that controller PCI address > alignment constraint. > > This patch series was tested using the pci endpoint test driver (as-is > and a modified version removing memory allocation alignment on the host > side) as well as with a prototype NVMe endpoint function driver (where > data transfers use addresses that are never aligned to any specific > boundary). > Applied to pci/endpoint! - Mani > Changes from v5: > - Changed patch 3 to rename the new controller operation to align_addr > and change its interface. Patch 6 is changed accordingly. > > Changes from v4: > - Removed the patch adding the pci_epc_map_align() function. The former > .map_align controller operation is now introduced in patch 3 as > "get_mem_map()" and used directly in the new pci_epf_mem_map() > function. > - Modified the documentation patch 4 to reflect the previous change. > - Changed patch 6 title and modified it to rename map_align to > get_mem_map in accordance with the new patch 3. > - Added review tags > > Changes from v3: > - Addressed Niklas comments (improved patch 2 commit message, added > comments in the pci_epc_map_align() function in patch 3, typos and > improvements in patch 5, patch 7 commit message). > - Added review tags > > Changes from v2: > - Dropped all patches for the rockchip-ep. These patches will be sent > later as a separate series. > - Added patch 2 and 5 > - Added review tags to patch 1 > > Changes from v1: > - Changed pci_epc_check_func() to pci_epc_function_is_valid() in patch > 1. > - Removed patch "PCI: endpoint: Improve pci_epc_mem_alloc_addr()" > (former patch 2 of v1) > - Various typos cleanups all over. Also fixed some blank space > indentation. > - Added review tags > > Damien Le Moal (6): > PCI: endpoint: Introduce pci_epc_function_is_valid() > PCI: endpoint: Improve pci_epc_mem_alloc_addr() > PCI: endpoint: Introduce pci_epc_mem_map()/unmap() > PCI: endpoint: Update documentation > PCI: endpoint: test: Use pci_epc_mem_map/unmap() > PCI: dwc: endpoint: Implement the pci_epc_ops::align_addr() operation > > Documentation/PCI/endpoint/pci-endpoint.rst | 29 ++ > .../pci/controller/dwc/pcie-designware-ep.c | 15 + > drivers/pci/endpoint/functions/pci-epf-test.c | 372 +++++++++--------- > drivers/pci/endpoint/pci-epc-core.c | 182 ++++++--- > drivers/pci/endpoint/pci-epc-mem.c | 9 +- > include/linux/pci-epc.h | 38 ++ > 6 files changed, 415 insertions(+), 230 deletions(-) > > -- > 2.47.0 >
On 10/12/24 20:57, Manivannan Sadhasivam wrote: > On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote: >> This series introduces the new functions pci_epc_mem_map() and >> pci_epc_mem_unmap() to improve handling of the PCI address mapping >> alignment constraints of endpoint controllers in a controller >> independent manner. >> >> The issue fixed is that the fixed alignment defined by the "align" field >> of struct pci_epc_features is defined for inbound ATU entries (e.g. >> BARs) and is a fixed value, whereas some controllers need a PCI address >> alignment that depends on the PCI address itself. For instance, the >> rk3399 SoC controller in endpoint mode uses the lower bits of the local >> endpoint memory address as the lower bits for the PCI addresses for data >> transfers. That is, when mapping local memory, one must take into >> account the number of bits of the RC PCI address that change from the >> start address of the mapping. >> >> To fix this, the new endpoint controller method .align_addr is >> introduced and called from the new pci_epc_mem_map() function. This >> method is optional and for controllers that do not define it, it is >> assumed that the controller has no PCI address constraint. >> >> The functions pci_epc_mem_map() is a helper function which obtains >> the mapping information, allocates endpoint controller memory according >> to the mapping size obtained and maps the memory. pci_epc_mem_unmap() >> unmaps and frees the endpoint memory. >> >> This series is organized as follows: >> - Patch 1 introduces a small helper to clean up the epc core code >> - Patch 2 improves pci_epc_mem_alloc_addr() >> - Patch 3 introduce the new align_addr() endpoint controller method >> and the epc functions pci_epc_mem_map() and pci_epc_mem_unmap(). >> - Patch 4 documents these new functions. >> - Patch 5 modifies the test endpoint function driver to use >> pci_epc_mem_map() and pci_epc_mem_unmap() to illustrate the use of >> these functions. >> - Finally, patch 6 implements the RK3588 endpoint controller driver >> .align_addr() operation to satisfy that controller PCI address >> alignment constraint. >> >> This patch series was tested using the pci endpoint test driver (as-is >> and a modified version removing memory allocation alignment on the host >> side) as well as with a prototype NVMe endpoint function driver (where >> data transfers use addresses that are never aligned to any specific >> boundary). >> > > Applied to pci/endpoint! Awesome ! Thanks a lot !
On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote: > This series introduces the new functions pci_epc_mem_map() and > pci_epc_mem_unmap() to improve handling of the PCI address mapping > alignment constraints of endpoint controllers in a controller > independent manner. Hi Damien, I know this is obvious to everybody except me, but who uses this mapping? A driver running on the endpoint that does MMIO? DMA (Memory Reads or Writes that target an endpoint BAR)? I'm still trying to wrap my head around the whole endpoint driver model. Bjorn
On 10/22/24 07:19, Bjorn Helgaas wrote: > On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote: >> This series introduces the new functions pci_epc_mem_map() and >> pci_epc_mem_unmap() to improve handling of the PCI address mapping >> alignment constraints of endpoint controllers in a controller >> independent manner. > > Hi Damien, I know this is obvious to everybody except me, but who uses > this mapping? A driver running on the endpoint that does MMIO? DMA > (Memory Reads or Writes that target an endpoint BAR)? I'm still > trying to wrap my head around the whole endpoint driver model. The mapping API is for mmio or DMA using enpoint controller memory mapped to a host PCI address range. It is not for BARs. BARs setup does not use the same API and has not changed with these patches. BARs can still be accessed on the enpoint (within the EPF driver) with regular READ_ONCE()/WRITE_ONCE() once they are set. But any data transfer between the PCI RC host and the EPF driver on the EP host that use mmio or DMA generic channel (memory copy offload channel) needs the new mapping API. DMA transfers that can be done using dedicated DMA rx/tx channels associated with the endpoint controller do not need to use this API as the mapping to the host PCI address space is automatically handled by the DMA driver.
On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote: > On 10/22/24 07:19, Bjorn Helgaas wrote: > > On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote: > > DMA transfers that can be done using dedicated DMA rx/tx channels associated > with the endpoint controller do not need to use this API as the mapping to > the host PCI address space is automatically handled by the DMA driver. I disagree with this part. It think that it depends on the DMA controller. Looking at the manual for e.g. the embedded DMA controller on the DWC controller (eDMA): "" When you do not want the iATU to translate outbound requests that are generated by the internal DMA module, then you must implement one of the following approaches: - Ensure that the combination of DMA channel programming and iATU control register programming, causes no translation of DMA traffic to be done in the iATU. or - Activate the DMA bypass mode to allow request TLPs which are initiated by the DMA controller to pass through the iATU untranslated. You can activate DMA bypass mode by setting the DMA Bypass field of the iATU Control 2 Register (IATU_REGION_C- TRL_OFF_2_OUTBOUND_0). "" However, we also know that, if there is no match in the iATU table: "" The default behavior of the ATU when there is no address match in the outbound direction or no TLP attribute match in the inbound direction, is to pass the transaction through. "" I.e. if there is a iATU mapping (which is what pci_epc_map_addr() sets up), the eDMA will take that into account. If there is none, it will go through untranslated. So for the eDMA embedded on the DWC controller, we do not strictly need to call pci_epc_map_addr(). (However, we probably want to do it anyway, as long as we haven't enabled DMA bypass mode, just to make sure that there is no other iATU mapping in the mapping table that will affect our transfer.) However, if you think about a generic DMA controller, e.g. an ARM primecell pl330, I don't see how it that DMA controller will be able to perform transfers correctly if there is not an iATU mapping for the region that it is reading/writing to. So the safest thing, that will work with any DMA controller is probably to always call pci_epc_map_addr() before doing the transfer. However, as I pointed out in: https://lore.kernel.org/lkml/Zg5oeDzq5u3jmKIu@ryzen/ This behavior is still inconsistent between PCI EPF drivers: E.g. for pci-epf-test.c: When using dedicated DMA rx/tx channels (epf_test->dma_private == true), and when FLAG_USE_DMA is set, it currently always calls pci_epc_map_addr() before doing the transfer. However for pci-epf-mhi.c: When using dedicated DMA rx/tx channels and when MHI_EPF_USE_DMA is set, it currently does not call pci_epc_map_addr() before doing the transfer. Kind regards, Niklas
On 10/22/24 17:38, Niklas Cassel wrote: > On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote: >> On 10/22/24 07:19, Bjorn Helgaas wrote: >>> On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote: >> >> DMA transfers that can be done using dedicated DMA rx/tx channels associated >> with the endpoint controller do not need to use this API as the mapping to >> the host PCI address space is automatically handled by the DMA driver. > > I disagree with this part. It think that it depends on the DMA controller. > > Looking at the manual for e.g. the embedded DMA controller on the DWC > controller (eDMA): > "" > When you do not want the iATU to translate outbound requests that are generated by the > internal DMA module, then you must implement one of the following approaches: > - Ensure that the combination of DMA channel programming and iATU control register > programming, causes no translation of DMA traffic to be done in the iATU. > or > - Activate the DMA bypass mode to allow request TLPs which are initiated by the DMA > controller to pass through the iATU untranslated. You can activate DMA bypass mode by > setting the DMA Bypass field of the iATU Control 2 Register (IATU_REGION_C- > TRL_OFF_2_OUTBOUND_0). > "" > > However, we also know that, if there is no match in the iATU table: > "" > The default behavior of the ATU when there is no address match in the outbound direction or no > TLP attribute match in the inbound direction, is to pass the transaction through. > "" > > I.e. if there is a iATU mapping (which is what pci_epc_map_addr() sets up), > the eDMA will take that into account. If there is none, it will go through > untranslated. > > > So for the eDMA embedded on the DWC controller, we do not strictly need to > call pci_epc_map_addr(). (However, we probably want to do it anyway, as long > as we haven't enabled DMA bypass mode, just to make sure that there is no > other iATU mapping in the mapping table that will affect our transfer.) > > However, if you think about a generic DMA controller, e.g. an ARM primecell > pl330, I don't see how it that DMA controller will be able to perform > transfers correctly if there is not an iATU mapping for the region that it > is reading/writing to. > > So the safest thing, that will work with any DMA controller is probably to > always call pci_epc_map_addr() before doing the transfer. Indeed, I think you are right. Not doing the mapping before DMA with the RK3588 works fine, but that may not be the case for all controllers.
On Tue, Oct 22, 2024 at 10:38:58AM +0200, Niklas Cassel wrote: > On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote: > > On 10/22/24 07:19, Bjorn Helgaas wrote: > > > On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote: > > > > DMA transfers that can be done using dedicated DMA rx/tx channels associated > > with the endpoint controller do not need to use this API as the mapping to > > the host PCI address space is automatically handled by the DMA driver. > > I disagree with this part. It think that it depends on the DMA controller. > > Looking at the manual for e.g. the embedded DMA controller on the DWC > controller (eDMA): > "" > When you do not want the iATU to translate outbound requests that are generated by the > internal DMA module, then you must implement one of the following approaches: > - Ensure that the combination of DMA channel programming and iATU control register > programming, causes no translation of DMA traffic to be done in the iATU. > or > - Activate the DMA bypass mode to allow request TLPs which are initiated by the DMA > controller to pass through the iATU untranslated. You can activate DMA bypass mode by > setting the DMA Bypass field of the iATU Control 2 Register (IATU_REGION_C- > TRL_OFF_2_OUTBOUND_0). > "" > > However, we also know that, if there is no match in the iATU table: > "" > The default behavior of the ATU when there is no address match in the outbound direction or no > TLP attribute match in the inbound direction, is to pass the transaction through. > "" > > I.e. if there is a iATU mapping (which is what pci_epc_map_addr() sets up), > the eDMA will take that into account. If there is none, it will go through > untranslated. > > > So for the eDMA embedded on the DWC controller, we do not strictly need to > call pci_epc_map_addr(). (However, we probably want to do it anyway, as long > as we haven't enabled DMA bypass mode, just to make sure that there is no > other iATU mapping in the mapping table that will affect our transfer.) > I do not recommend that. EPF driver *should* know how to isolate the address spaces between DMA and iATU. Creating the iATU mapping for the DMA address is just defeating the purpose of using DMA. If any overlap mapping is present, then the EPF driver has a bug! > However, if you think about a generic DMA controller, e.g. an ARM primecell > pl330, I don't see how it that DMA controller will be able to perform > transfers correctly if there is not an iATU mapping for the region that it > is reading/writing to. > I don't think the generic DMA controller can be used to read/write to remote memory. It needs to be integrated with the PCIe IP so that it can issue PCIe transactions. > So the safest thing, that will work with any DMA controller is probably to > always call pci_epc_map_addr() before doing the transfer. > > > However, as I pointed out in: > https://lore.kernel.org/lkml/Zg5oeDzq5u3jmKIu@ryzen/ > > This behavior is still inconsistent between PCI EPF drivers: > E.g. for pci-epf-test.c: > When using dedicated DMA rx/tx channels (epf_test->dma_private == true), > and when FLAG_USE_DMA is set, it currently always calls pci_epc_map_addr() > before doing the transfer. > > However for pci-epf-mhi.c: > When using dedicated DMA rx/tx channels and when MHI_EPF_USE_DMA is set, > it currently does not call pci_epc_map_addr() before doing the transfer. > Because, there are not going to be any overlapping mappings. But I agree with the inconsistency between EPF drivers though... - Mani
On Tue, Oct 22, 2024 at 07:26:31PM +0530, Manivannan Sadhasivam wrote: > On Tue, Oct 22, 2024 at 10:38:58AM +0200, Niklas Cassel wrote: > > On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote: > > > On 10/22/24 07:19, Bjorn Helgaas wrote: > > > > On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote: > > > However, if you think about a generic DMA controller, e.g. an ARM primecell > > pl330, I don't see how it that DMA controller will be able to perform > > transfers correctly if there is not an iATU mapping for the region that it > > is reading/writing to. > > > > I don't think the generic DMA controller can be used to read/write to remote > memory. It needs to be integrated with the PCIe IP so that it can issue PCIe > transactions. I'm not an expert, so I might of course be misunderstanding how things work. When the CPU performs an AXI read/write to a MMIO address within the PCIe controller (specifically the PCIe controller's outbound memory window), the PCIe controller translates that incoming read/write to a read/write on the PCIe bus. (The PCI address of the generated PCIe transaction will depend on how the iATU has been configured, which determines how reads/writes to the PCIe controller's outbound memory window should be translated to PCIe addresses.) If that is how it works when the CPU does the AXI read/write, why wouldn't things work the same if it is an generic DMA controller performing the AXI read/write to the MMIO address targeting the PCIe controller's outbound memory window? Kind regards, Niklas
On Tue, Oct 22, 2024 at 04:16:24PM +0200, Niklas Cassel wrote: > On Tue, Oct 22, 2024 at 07:26:31PM +0530, Manivannan Sadhasivam wrote: > > On Tue, Oct 22, 2024 at 10:38:58AM +0200, Niklas Cassel wrote: > > > On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote: > > > > On 10/22/24 07:19, Bjorn Helgaas wrote: > > > > > On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote: > > > > > However, if you think about a generic DMA controller, e.g. an ARM primecell > > > pl330, I don't see how it that DMA controller will be able to perform > > > transfers correctly if there is not an iATU mapping for the region that it > > > is reading/writing to. > > > > > > > I don't think the generic DMA controller can be used to read/write to remote > > memory. It needs to be integrated with the PCIe IP so that it can issue PCIe > > transactions. > > I'm not an expert, so I might of course be misunderstanding how things work. > > When the CPU performs an AXI read/write to a MMIO address within the PCIe > controller (specifically the PCIe controller's outbound memory window), > the PCIe controller translates that incoming read/write to a read/write on the > PCIe bus. > > (The PCI address of the generated PCIe transaction will depend on how the iATU > has been configured, which determines how reads/writes to the PCIe controller's > outbound memory window should be translated to PCIe addresses.) > > If that is how it works when the CPU does the AXI read/write, why wouldn't > things work the same if it is an generic DMA controller performing the AXI > read/write to the MMIO address targeting the PCIe controller's outbound memory > window? generic DMA controller can preforming memory to memory (PCI map windows) to do data transfer. But MMIO need map by iATU of pci controller. for example: copy 0x4000_0000 to PCI host's 0xA_0000_0000 EP memory (0x4000_0000), -> DMA -> PCI map windows (iATU) (0x8000_0000) -> PCI bus (0xA_0000_0000-> Host memory (0xA_0000_0000). But embedded DMA can bypass iATU. Directy send out TLP. EP memory (0x4000_0000) -> PCI controller DMA -> PCI BUS (0xA_0000_0000) -> Host memmory (0xA_0000_0000) anthor words, eDMA can copy data to any place of PCI host memory. generally DMA only copy data to EP PCI map window. Frank > > > Kind regards, > Niklas
On Tue, Oct 22, 2024 at 04:16:24PM +0200, Niklas Cassel wrote: > On Tue, Oct 22, 2024 at 07:26:31PM +0530, Manivannan Sadhasivam wrote: > > On Tue, Oct 22, 2024 at 10:38:58AM +0200, Niklas Cassel wrote: > > > On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote: > > > > On 10/22/24 07:19, Bjorn Helgaas wrote: > > > > > On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote: > > > > > However, if you think about a generic DMA controller, e.g. an ARM primecell > > > pl330, I don't see how it that DMA controller will be able to perform > > > transfers correctly if there is not an iATU mapping for the region that it > > > is reading/writing to. > > > > > > > I don't think the generic DMA controller can be used to read/write to remote > > memory. It needs to be integrated with the PCIe IP so that it can issue PCIe > > transactions. > > I'm not an expert, so I might of course be misunderstanding how things work. > Neither am I :) I'm just sharing my understanding based on reading the DWC spec and open to get corrected if I'm wrong. > When the CPU performs an AXI read/write to a MMIO address within the PCIe > controller (specifically the PCIe controller's outbound memory window), > the PCIe controller translates that incoming read/write to a read/write on the > PCIe bus. > I don't think the *PCIe controller* translates the read/writes, but the iATU. If we use iATU, then the remote address needs to be mapped to the endpoint DDR and if CPU performs AXI read/write to that address, then iATU will translate the DDR address to remote address and then issue PCIe transactions (together with the PCIe controller). And if DMA is used, then DMA controller can issue PCIe transactions to the remote memory itself (again, together with the PCIe controller). So no mapping is required here. - Mani
On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote: > On 10/22/24 07:19, Bjorn Helgaas wrote: > > On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote: > >> This series introduces the new functions pci_epc_mem_map() and > >> pci_epc_mem_unmap() to improve handling of the PCI address mapping > >> alignment constraints of endpoint controllers in a controller > >> independent manner. > > > > Hi Damien, I know this is obvious to everybody except me, but who > > uses this mapping? A driver running on the endpoint that does > > MMIO? DMA (Memory Reads or Writes that target an endpoint BAR)? > > I'm still trying to wrap my head around the whole endpoint driver > > model. > > The mapping API is for mmio or DMA using endpoint controller memory > mapped to a host PCI address range. It is not for BARs. BARs setup > does not use the same API and has not changed with these patches. > > BARs can still be accessed on the endpoint (within the EPF driver) > with regular READ_ONCE()/WRITE_ONCE() once they are set. But any > data transfer between the PCI RC host and the EPF driver on the EP > host that use mmio or DMA generic channel (memory copy offload > channel) needs the new mapping API. DMA transfers that can be done > using dedicated DMA rx/tx channels associated with the endpoint > controller do not need to use this API as the mapping to the host > PCI address space is automatically handled by the DMA driver. Sorry I'm dense. I'm really not used to thinking in the endpoint point of view. Correct me where I go off the rails: - This code (pci_epc_mem_map()) runs on an endpoint, basically as part of some endpoint firmware, right? - On the endpoint's PCIe side, it receives memory read/write TLPs? - These TLPs would be generated elsewhere in the PCIe fabric, e.g., by a driver on the host doing MMIO to the endpoint, or possibly another endpoint doing peer-to-peer DMA. - Mem read/write TLPs are routed by address, and the endpoint accepts them if the address matches one of its BARs. - This is a little different from a Root Complex, which would basically treat reads/writes to anything outside the Root Port windows as incoming DMA headed to physical memory. - A Root Complex would use the TLP address (the PCI bus address) directly as a CPU physical memory address unless the TLP address is translated by an IOMMU. - For the endpoint, you want the BAR to be an aperture to physical memory in the address space of the SoC driving the endpoint. - The SoC physical memory address may be different from the PCI but address in the TLP, and pci_epc_mem_map() is the way to account for this? - IOMMU translations from PCI to CPU physical address space are pretty arbitrary and needn't be contiguous on the CPU side. - pci_epc_mem_map() sets up a conceptually similar PCI to CPU address space translation, but it's much simpler because it basically applies just a constant offset?
On 10/23/24 05:47, Bjorn Helgaas wrote: > On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote: >> On 10/22/24 07:19, Bjorn Helgaas wrote: >>> On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote: >>>> This series introduces the new functions pci_epc_mem_map() and >>>> pci_epc_mem_unmap() to improve handling of the PCI address mapping >>>> alignment constraints of endpoint controllers in a controller >>>> independent manner. >>> >>> Hi Damien, I know this is obvious to everybody except me, but who >>> uses this mapping? A driver running on the endpoint that does >>> MMIO? DMA (Memory Reads or Writes that target an endpoint BAR)? >>> I'm still trying to wrap my head around the whole endpoint driver >>> model. >> >> The mapping API is for mmio or DMA using endpoint controller memory >> mapped to a host PCI address range. It is not for BARs. BARs setup >> does not use the same API and has not changed with these patches. >> >> BARs can still be accessed on the endpoint (within the EPF driver) >> with regular READ_ONCE()/WRITE_ONCE() once they are set. But any >> data transfer between the PCI RC host and the EPF driver on the EP >> host that use mmio or DMA generic channel (memory copy offload >> channel) needs the new mapping API. DMA transfers that can be done >> using dedicated DMA rx/tx channels associated with the endpoint >> controller do not need to use this API as the mapping to the host >> PCI address space is automatically handled by the DMA driver. > > Sorry I'm dense. I'm really not used to thinking in the endpoint > point of view. Correct me where I go off the rails: > > - This code (pci_epc_mem_map()) runs on an endpoint, basically as > part of some endpoint firmware, right? Not sure what you mean by "firmware" here. pci_epc_mem_map() is intended for use by a PCI endpoint function driver on the EP side, nothing else. > - On the endpoint's PCIe side, it receives memory read/write TLPs? pci_epc_mem_map() does not receive anything itself. It only allocates local EP controller memory from outbound ATU windows and map that to a PCI address region of the host (RC side). Here "map" means programming the ATU using a correctly aligned PCI address that satisfies the EP PCI controller alignment constraints. The memory read/write TLPs will happen once the EP function driver access the mapped memory with memcpy_fromio/toio() (or use generic memory copy offload DMA channel). > - These TLPs would be generated elsewhere in the PCIe fabric, e.g., > by a driver on the host doing MMIO to the endpoint, or possibly > another endpoint doing peer-to-peer DMA. MMIO or DMA, yes. > > - Mem read/write TLPs are routed by address, and the endpoint > accepts them if the address matches one of its BARs. Yes, but pci_epc_mem_map() is not for use for BARs on the EP side. BARs setup on EP PCI controllers use inbound ATU windows. pci_epc_mem_map() programs outbound ATU windows. BARs setup use pci_epf_alloc_space() + pci_epc_set_bar() to setup a BAR. > > - This is a little different from a Root Complex, which would > basically treat reads/writes to anything outside the Root Port > windows as incoming DMA headed to physical memory. > > - A Root Complex would use the TLP address (the PCI bus address) > directly as a CPU physical memory address unless the TLP address > is translated by an IOMMU. > > - For the endpoint, you want the BAR to be an aperture to physical > memory in the address space of the SoC driving the endpoint. Yes, and as said above this is done with pci_epf_alloc_space() + pci_epc_set_bar(), not pci_epc_mem_map(). > - The SoC physical memory address may be different from the PCI but > address in the TLP, and pci_epc_mem_map() is the way to account > for this? pci_epc_mem_map() is essentially a call to pci_epc_mem_alloc_addr() + pci_epc_map_addr(). pci_epc_mem_alloc_addr() allocates memory from the EP PCI controller outbound windows and pci_epc_map_addr() programs an EP PCI controller outbound ATU entry to map that memory to some PCI address region of the host (RC). pci_epc_mem_map() was created because the current epc API does not hide any of the EP PCI controller PCI address alignment constraints for programming outbound ATU entries. This means that using directly pci_epc_mem_alloc_addr() + pci_epc_map_addr(), things may or may not work (often the latter) depending on the PCI address region (start address and its size) that is to be mapped and accessed by the endpoint function driver. Most EP PCI controllers at least require a PCI address to be aligned to some fixed boundary (e.g. 64K for the RK3588 SoC on the Rock5B board). Even such alignment boundary/mask is not exposed through the API (epc_features->align is for BARs and should not be used for outbound ATU programming). Worse yet, some EP PCI controllers use the lower bits of a local memory window address range as the lower bits of the RC PCI address range when generating TLPs (e.g. RK3399 and Cadence EP controller). For these EP PCI controllers, the programming of outbound ATU entries for mapping a RC PCI address region thus endup depending on the start PCI address of the region *and* the region size (as the size drives the number of lower bits of address that will change over the entire region). The above API issues where essentially unsolvable from a regular endpoint driver correctly coded to be EP controller independent. I could not get my prototype nvme endpoint function driver to work without introducing pci_epc_mem_map() (nvme does not require any special alignment of command buffers, so on an nvme EPF driver we end up having to deal with essentially random PCI addresses that have no special alignment). pci_epc_mem_map() relies on the new ->align_addr() EP controller operation to get information about the start address and the size to use for mapping to local memory a PCI address region. The size given is used for pci_epc_mem_alloc_addr() and the start address and size given are used with pci_epc_map_addr(). This generates a correct mapping regardless of the PCI address and size given. And for the actual data access by the EP function driver, pci_epc_mem_map() gives the address within the mapping that correspond to the PCI address that we wanted mapped. Et voila, problem solved :) > > - IOMMU translations from PCI to CPU physical address space are > pretty arbitrary and needn't be contiguous on the CPU side. > > - pci_epc_mem_map() sets up a conceptually similar PCI to CPU > address space translation, but it's much simpler because it > basically applies just a constant offset?
On 10/23/24 00:30, Manivannan Sadhasivam wrote: > On Tue, Oct 22, 2024 at 04:16:24PM +0200, Niklas Cassel wrote: >> On Tue, Oct 22, 2024 at 07:26:31PM +0530, Manivannan Sadhasivam wrote: >>> On Tue, Oct 22, 2024 at 10:38:58AM +0200, Niklas Cassel wrote: >>>> On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote: >>>>> On 10/22/24 07:19, Bjorn Helgaas wrote: >>>>>> On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote: >>> >>>> However, if you think about a generic DMA controller, e.g. an ARM primecell >>>> pl330, I don't see how it that DMA controller will be able to perform >>>> transfers correctly if there is not an iATU mapping for the region that it >>>> is reading/writing to. >>>> >>> >>> I don't think the generic DMA controller can be used to read/write to remote >>> memory. It needs to be integrated with the PCIe IP so that it can issue PCIe >>> transactions. >> >> I'm not an expert, so I might of course be misunderstanding how things work. >> > > Neither am I :) I'm just sharing my understanding based on reading the DWC spec > and open to get corrected if I'm wrong. > >> When the CPU performs an AXI read/write to a MMIO address within the PCIe >> controller (specifically the PCIe controller's outbound memory window), >> the PCIe controller translates that incoming read/write to a read/write on the >> PCIe bus. >> > > I don't think the *PCIe controller* translates the read/writes, but the iATU. If > we use iATU, then the remote address needs to be mapped to the endpoint DDR and > if CPU performs AXI read/write to that address, then iATU will translate the DDR > address to remote address and then issue PCIe transactions (together with the > PCIe controller). > > And if DMA is used, then DMA controller can issue PCIe transactions to the > remote memory itself (again, together with the PCIe controller). So no mapping > is required here. Caveat: DMA here cannot be the case "using the generic memory copy offload DMA channel". It must be DMA driven by hardware rx/tx DMA channels. For memory copy offload DMA channel, an EPF must map the address range to "DMA" to (which is really memcpy_toio/fromio() but the DMA API will hide that...). Would love to have so EPF wrapper API around all that mess to simplify EPF drivers. They all end up having to do the exact same things. > > - Mani >
On Wed, Oct 23, 2024 at 07:05:54AM +0900, Damien Le Moal wrote: > On 10/23/24 05:47, Bjorn Helgaas wrote: > > On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote: > >> On 10/22/24 07:19, Bjorn Helgaas wrote: > >>> On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote: > >>>> This series introduces the new functions pci_epc_mem_map() and > >>>> pci_epc_mem_unmap() to improve handling of the PCI address mapping > >>>> alignment constraints of endpoint controllers in a controller > >>>> independent manner. > >>> > >>> Hi Damien, I know this is obvious to everybody except me, but who > >>> uses this mapping? A driver running on the endpoint that does > >>> MMIO? DMA (Memory Reads or Writes that target an endpoint BAR)? > >>> I'm still trying to wrap my head around the whole endpoint driver > >>> model. > >> > >> The mapping API is for mmio or DMA using endpoint controller memory > >> mapped to a host PCI address range. It is not for BARs. BARs setup > >> does not use the same API and has not changed with these patches. > >> > >> BARs can still be accessed on the endpoint (within the EPF driver) > >> with regular READ_ONCE()/WRITE_ONCE() once they are set. But any > >> data transfer between the PCI RC host and the EPF driver on the EP > >> host that use mmio or DMA generic channel (memory copy offload > >> channel) needs the new mapping API. DMA transfers that can be done > >> using dedicated DMA rx/tx channels associated with the endpoint > >> controller do not need to use this API as the mapping to the host > >> PCI address space is automatically handled by the DMA driver. > > > > Sorry I'm dense. I'm really not used to thinking in the endpoint > > point of view. Correct me where I go off the rails: > > > > - This code (pci_epc_mem_map()) runs on an endpoint, basically as > > part of some endpoint firmware, right? > > Not sure what you mean by "firmware" here. pci_epc_mem_map() is > intended for use by a PCI endpoint function driver on the EP side, > nothing else. I mean the software operating the endpoint from the "inside", not from the PCIe side. In a non-endpoint framework scenario, the kernel running on the host enumerates PCI devices using config reads, and a driver like nvme runs on the host and operates an NVMe device with config accesses and MMIO to the NVMe BARs. I referred to "firmware" because if the endpoint were on a PCIe plug-in card, an SoC on the card could run firmware that uses the endpoint framework to respond to PCIe activity generated by the host. (Of course, if the host driver programs the endpoint appropriately, the endpoint might also initiate its own PCIe activity for DMA.) > > - On the endpoint's PCIe side, it receives memory read/write TLPs? > > pci_epc_mem_map() does not receive anything itself. It only > allocates local EP controller memory from outbound ATU windows and > map that to a PCI address region of the host (RC side). Here "map" > means programming the ATU using a correctly aligned PCI address that > satisfies the EP PCI controller alignment constraints. I assumed we were talking about an endpoint responding to PCIe traffic generated elsewhere, but I think my assumption was wrong. I don't have any specs for these endpoint controllers, and there's nothing in Documentation/PCI/endpoint/ about ATUs. Some doc and a few pictures would go a long ways. > The memory read/write TLPs will happen once the EP function driver > access the mapped memory with memcpy_fromio/toio() (or use generic > memory copy offload DMA channel). This would be the endpoint itself *initiating* DMA TLPs on PCIe, not responding to incoming DMA, I guess. > > - These TLPs would be generated elsewhere in the PCIe fabric, e.g., > > by a driver on the host doing MMIO to the endpoint, or possibly > > another endpoint doing peer-to-peer DMA. > > MMIO or DMA, yes. > > > - Mem read/write TLPs are routed by address, and the endpoint > > accepts them if the address matches one of its BARs. > > Yes, but pci_epc_mem_map() is not for use for BARs on the EP side. > BARs setup on EP PCI controllers use inbound ATU windows. > pci_epc_mem_map() programs outbound ATU windows. BARs setup use > pci_epf_alloc_space() + pci_epc_set_bar() to setup a BAR. > > > - This is a little different from a Root Complex, which would > > basically treat reads/writes to anything outside the Root Port > > windows as incoming DMA headed to physical memory. > > > > - A Root Complex would use the TLP address (the PCI bus address) > > directly as a CPU physical memory address unless the TLP address > > is translated by an IOMMU. > > > > - For the endpoint, you want the BAR to be an aperture to physical > > memory in the address space of the SoC driving the endpoint. > > Yes, and as said above this is done with pci_epf_alloc_space() + > pci_epc_set_bar(), not pci_epc_mem_map(). And if the endpoint is *initiating* DMA, BARs of that endpoint are not involved at all. The target of the DMA would be either host memory or a BAR of another endpoint. > > - The SoC physical memory address may be different from the PCI but > > address in the TLP, and pci_epc_mem_map() is the way to account > > for this? > > pci_epc_mem_map() is essentially a call to pci_epc_mem_alloc_addr() > + pci_epc_map_addr(). pci_epc_mem_alloc_addr() allocates memory from > the EP PCI controller outbound windows and pci_epc_map_addr() > programs an EP PCI controller outbound ATU entry to map that memory > to some PCI address region of the host (RC). I see "RC" and "RC address" used often in this area, but I don't quite understand the RC connection. It sounds like this might be the PCI memory address space, i.e., the set of addresses that might appear in PCIe TLPs? Does "EPC addr space" refer to the physical address space of the processor operating the endpoint, or does it mean something more specific? > pci_epc_mem_map() was created because the current epc API does not > hide any of the EP PCI controller PCI address alignment constraints > for programming outbound ATU entries. This means that using directly > pci_epc_mem_alloc_addr() + pci_epc_map_addr(), things may or may not > work (often the latter) depending on the PCI address region (start > address and its size) that is to be mapped and accessed by the > endpoint function driver. > > Most EP PCI controllers at least require a PCI address to be aligned > to some fixed boundary (e.g. 64K for the RK3588 SoC on the Rock5B > board). Even such alignment boundary/mask is not exposed through the > API (epc_features->align is for BARs and should not be used for > outbound ATU programming). Worse yet, some EP PCI controllers use > the lower bits of a local memory window address range as the lower > bits of the RC PCI address range when generating TLPs (e.g. RK3399 > and Cadence EP controller). For these EP PCI controllers, the > programming of outbound ATU entries for mapping a RC PCI address > region thus endup depending on the start PCI address of the region > *and* the region size (as the size drives the number of lower bits > of address that will change over the entire region). > > The above API issues where essentially unsolvable from a regular > endpoint driver correctly coded to be EP controller independent. I > could not get my prototype nvme endpoint function driver to work > without introducing pci_epc_mem_map() (nvme does not require any > special alignment of command buffers, so on an nvme EPF driver we > end up having to deal with essentially random PCI addresses that > have no special alignment). > > pci_epc_mem_map() relies on the new ->align_addr() EP controller > operation to get information about the start address and the size to > use for mapping to local memory a PCI address region. The size given > is used for pci_epc_mem_alloc_addr() and the start address and size > given are used with pci_epc_map_addr(). This generates a correct > mapping regardless of the PCI address and size given. And for the > actual data access by the EP function driver, pci_epc_mem_map() > gives the address within the mapping that correspond to the PCI > address that we wanted mapped. > > - IOMMU translations from PCI to CPU physical address space are > > pretty arbitrary and needn't be contiguous on the CPU side. > > > > - pci_epc_mem_map() sets up a conceptually similar PCI to CPU > > address space translation, but it's much simpler because it > > basically applies just a constant offset?
On 10/23/24 08:49, Bjorn Helgaas wrote: > On Wed, Oct 23, 2024 at 07:05:54AM +0900, Damien Le Moal wrote: >> On 10/23/24 05:47, Bjorn Helgaas wrote: >>> On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote: >>>> On 10/22/24 07:19, Bjorn Helgaas wrote: >>>>> On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote: >>>>>> This series introduces the new functions pci_epc_mem_map() and >>>>>> pci_epc_mem_unmap() to improve handling of the PCI address mapping >>>>>> alignment constraints of endpoint controllers in a controller >>>>>> independent manner. >>>>> >>>>> Hi Damien, I know this is obvious to everybody except me, but who >>>>> uses this mapping? A driver running on the endpoint that does >>>>> MMIO? DMA (Memory Reads or Writes that target an endpoint BAR)? >>>>> I'm still trying to wrap my head around the whole endpoint driver >>>>> model. >>>> >>>> The mapping API is for mmio or DMA using endpoint controller memory >>>> mapped to a host PCI address range. It is not for BARs. BARs setup >>>> does not use the same API and has not changed with these patches. >>>> >>>> BARs can still be accessed on the endpoint (within the EPF driver) >>>> with regular READ_ONCE()/WRITE_ONCE() once they are set. But any >>>> data transfer between the PCI RC host and the EPF driver on the EP >>>> host that use mmio or DMA generic channel (memory copy offload >>>> channel) needs the new mapping API. DMA transfers that can be done >>>> using dedicated DMA rx/tx channels associated with the endpoint >>>> controller do not need to use this API as the mapping to the host >>>> PCI address space is automatically handled by the DMA driver. >>> >>> Sorry I'm dense. I'm really not used to thinking in the endpoint >>> point of view. Correct me where I go off the rails: >>> >>> - This code (pci_epc_mem_map()) runs on an endpoint, basically as >>> part of some endpoint firmware, right? >> >> Not sure what you mean by "firmware" here. pci_epc_mem_map() is >> intended for use by a PCI endpoint function driver on the EP side, >> nothing else. > > I mean the software operating the endpoint from the "inside", not from > the PCIe side. Got it. So yes, a real device firmware, or a PCI endpoint function driver for a machine running the PCI endpoint framework. > In a non-endpoint framework scenario, the kernel running on the host > enumerates PCI devices using config reads, and a driver like nvme runs > on the host and operates an NVMe device with config accesses and MMIO > to the NVMe BARs. Yep, I know that. >> pci_epc_mem_map() does not receive anything itself. It only >> allocates local EP controller memory from outbound ATU windows and >> map that to a PCI address region of the host (RC side). Here "map" >> means programming the ATU using a correctly aligned PCI address that >> satisfies the EP PCI controller alignment constraints. > > I assumed we were talking about an endpoint responding to PCIe traffic > generated elsewhere, but I think my assumption was wrong. Not at all. See below. > I don't have any specs for these endpoint controllers, and there's > nothing in Documentation/PCI/endpoint/ about ATUs. Some doc and a few > pictures would go a long ways. Agree. The PCI endpoint API documentation is far from great and should be improved. Will try to do something about it. >> The memory read/write TLPs will happen once the EP function driver >> access the mapped memory with memcpy_fromio/toio() (or use generic >> memory copy offload DMA channel). > > This would be the endpoint itself *initiating* DMA TLPs on PCIe, not > responding to incoming DMA, I guess. It can be both. What differs is how the endpint gets the PCI address to read/write. For an endpoint initiated transfer, typically, the PCI address is obtained from some "command" received through a BAR or through DMA. The command has the PCI addresses to use for transfering data to/from the host (e.g. an nvme rw command uses PRPs or SGLs to specify the PCI address segments for the data buffer of a command). For this case, the EPF driver calls calls pci_epc_mem_map() for the command buffer, does the transfer (memcpy_toio/fromio()) and unmaps with pci_epc_mem_unmap(). Note though that here, if an eDMA channel is used for the transfer, the DMA engine will do the mapping automatically and the epf does not need to call pci_epc_mem_map()/pci_epc_mem_unmap(). There is still an issue in this area which is that it is *not* clear if the DMA channel used can actually do the mapping automatically or not. E.g. the generic DMA channel (mem copy offload engine) will not. So there is still some API improvement needed to abstract more HW dependent things here. For a RC initiated transfer, the transfers are not done to random addresses. E.g. using NVMe CMB (controller memory buffer), the host will tell the endpoint "I want to use the CMB with this PCI address". Then the endpoint can use pci_epc_mem_map() to map the CMB to said PCI address and exchange data with the host through it. >>> - These TLPs would be generated elsewhere in the PCIe fabric, e.g., >>> by a driver on the host doing MMIO to the endpoint, or possibly >>> another endpoint doing peer-to-peer DMA. >> >> MMIO or DMA, yes. >> >>> - Mem read/write TLPs are routed by address, and the endpoint >>> accepts them if the address matches one of its BARs. >> >> Yes, but pci_epc_mem_map() is not for use for BARs on the EP side. >> BARs setup on EP PCI controllers use inbound ATU windows. >> pci_epc_mem_map() programs outbound ATU windows. BARs setup use >> pci_epf_alloc_space() + pci_epc_set_bar() to setup a BAR. >> >>> - This is a little different from a Root Complex, which would >>> basically treat reads/writes to anything outside the Root Port >>> windows as incoming DMA headed to physical memory. >>> >>> - A Root Complex would use the TLP address (the PCI bus address) >>> directly as a CPU physical memory address unless the TLP address >>> is translated by an IOMMU. >>> >>> - For the endpoint, you want the BAR to be an aperture to physical >>> memory in the address space of the SoC driving the endpoint. >> >> Yes, and as said above this is done with pci_epf_alloc_space() + >> pci_epc_set_bar(), not pci_epc_mem_map(). > > And if the endpoint is *initiating* DMA, BARs of that endpoint are not > involved at all. The target of the DMA would be either host memory or > a BAR of another endpoint. Yes. >>> - The SoC physical memory address may be different from the PCI but >>> address in the TLP, and pci_epc_mem_map() is the way to account >>> for this? >> >> pci_epc_mem_map() is essentially a call to pci_epc_mem_alloc_addr() >> + pci_epc_map_addr(). pci_epc_mem_alloc_addr() allocates memory from >> the EP PCI controller outbound windows and pci_epc_map_addr() >> programs an EP PCI controller outbound ATU entry to map that memory >> to some PCI address region of the host (RC). > > I see "RC" and "RC address" used often in this area, but I don't quite > understand the RC connection. It sounds like this might be the PCI > memory address space, i.e., the set of addresses that might appear in > PCIe TLPs? Yes. Sorry about that. PCI address == RC address in my head since the endpoint does not "own" the PCI address space, the host RC does. > Does "EPC addr space" refer to the physical address space of the > processor operating the endpoint, or does it mean something more > specific? Correct. EPC address is the local memory address on the endpoint, so CPU addresses. For something that maps to a RC PCI address, that generally mean addresses withing the EP PCI controller outbound memory regions.
On Wed, Oct 23, 2024 at 11:51:41AM +0900, Damien Le Moal wrote: (snip) > For an endpoint initiated transfer, typically, the PCI address is obtained from > some "command" received through a BAR or through DMA. The command has the PCI > addresses to use for transfering data to/from the host (e.g. an nvme rw command > uses PRPs or SGLs to specify the PCI address segments for the data buffer of a > command). For this case, the EPF driver calls calls pci_epc_mem_map() for the > command buffer, does the transfer (memcpy_toio/fromio()) and unmaps with > pci_epc_mem_unmap(). Note though that here, if an eDMA channel is used for the > transfer, the DMA engine will do the mapping automatically and the epf does not > need to call pci_epc_mem_map()/pci_epc_mem_unmap(). There is still an issue in > this area which is that it is *not* clear if the DMA channel used can actually > do the mapping automatically or not. E.g. the generic DMA channel (mem copy > offload engine) will not. So there is still some API improvement needed to > abstract more HW dependent things here. FWIW, in my final reply here: https://lore.kernel.org/lkml/ZiYuIaX7ZV0exKMt@ryzen/ " I did suggest that DWC-based drivers could set a DMA_SLAVE_SKIP_MEM_MAP flag or similar when registering the eDMA, which pci-epf-test then could check, but I got no response if anyone else thought that this was a good idea. " For DMA_SLAVE (private tx/rx DMA channels): For DWC-based controllers, we can definitely set DMA_SLAVE_SKIP_MEM_MAP when registering the eDMA (e.g. in dw_pcie_edma_detect()). However, I don't know how the DMA hardware (if any) in: drivers/pci/controller/cadence/pcie-cadence-ep.c drivers/pci/controller/pcie-rcar-ep.c drivers/pci/controller/pcie-rockchip-ep.c works, so I'm not sure if those drivers can set DMA_SLAVE_SKIP_MEM_MAP (the safest thing is to not set that flag, until we know how they work). For DMA_MEMCPY: We know that we need to perform pci_epc_mem_map() when using DMA API + "dummy" memcpy dma-channel (DMA_MEMCPY). I could imagine that some embedded DMA controllers also provide DMA_MEMCPY capabilities (in addition to DMA_SLAVE). So I guess the safest thing is to call the flag something like: PCI_EPC_DMA_SKIP_MEM_MAP (rather than PCI_EPC_DMA_SLAVE_SKIP_MEM_MAP). (since the embedded DMA controller might provide both DMA_SLAVE and DMA_MEMCPY). And let the EPC driver (e.g. dw_pcie_edma_detect()), or possibly the DMA driver itself to provide/set this flag. Kind regards, Niklas