mbox series

[v6,0/6] Improve PCI memory mapping API

Message ID 20241012113246.95634-1-dlemoal@kernel.org (mailing list archive)
Headers show
Series Improve PCI memory mapping API | expand

Message

Damien Le Moal Oct. 12, 2024, 11:32 a.m. UTC
This series introduces the new functions pci_epc_mem_map() and
pci_epc_mem_unmap() to improve handling of the PCI address mapping
alignment constraints of endpoint controllers in a controller
independent manner.

The issue fixed is that the fixed alignment defined by the "align" field
of struct pci_epc_features is defined for inbound ATU entries (e.g.
BARs) and is a fixed value, whereas some controllers need a PCI address
alignment that depends on the PCI address itself. For instance, the
rk3399 SoC controller in endpoint mode uses the lower bits of the local
endpoint memory address as the lower bits for the PCI addresses for data
transfers. That is, when mapping local memory, one must take into
account the number of bits of the RC PCI address that change from the
start address of the mapping.

To fix this, the new endpoint controller method .align_addr is
introduced and called from the new pci_epc_mem_map() function. This
method is optional and for controllers that do not define it, it is
assumed that the controller has no PCI address constraint.

The functions pci_epc_mem_map() is a helper function which obtains
the mapping information, allocates endpoint controller memory according
to the mapping size obtained and maps the memory. pci_epc_mem_unmap()
unmaps and frees the endpoint memory.

This series is organized as follows:
 - Patch 1 introduces a small helper to clean up the epc core code
 - Patch 2 improves pci_epc_mem_alloc_addr()
 - Patch 3 introduce the new align_addr() endpoint controller method
   and the epc functions pci_epc_mem_map() and pci_epc_mem_unmap().
 - Patch 4 documents these new functions.
 - Patch 5 modifies the test endpoint function driver to use 
   pci_epc_mem_map() and pci_epc_mem_unmap() to illustrate the use of
   these functions.
 - Finally, patch 6 implements the RK3588 endpoint controller driver
   .align_addr() operation to satisfy that controller PCI address
   alignment constraint.

This patch series was tested using the pci endpoint test driver (as-is
and a modified version removing memory allocation alignment on the host
side) as well as with a prototype NVMe endpoint function driver (where
data transfers use addresses that are never aligned to any specific
boundary).

Changes from v5:
 - Changed patch 3 to rename the new controller operation to align_addr
   and change its interface. Patch 6 is changed accordingly.

Changes from v4:
 - Removed the patch adding the pci_epc_map_align() function. The former
   .map_align controller operation is now introduced in patch 3 as
   "get_mem_map()" and used directly in the new pci_epf_mem_map()
   function.
 - Modified the documentation patch 4 to reflect the previous change.
 - Changed patch 6 title and modified it to rename map_align to
   get_mem_map in accordance with the new patch 3.
 - Added review tags

Changes from v3:
 - Addressed Niklas comments (improved patch 2 commit message, added
   comments in the pci_epc_map_align() function in patch 3, typos and
   improvements in patch 5, patch 7 commit message).
 - Added review tags

Changes from v2:
 - Dropped all patches for the rockchip-ep. These patches will be sent
   later as a separate series.
 - Added patch 2 and 5
 - Added review tags to patch 1

Changes from v1:
 - Changed pci_epc_check_func() to pci_epc_function_is_valid() in patch
   1.
 - Removed patch "PCI: endpoint: Improve pci_epc_mem_alloc_addr()"
   (former patch 2 of v1)
 - Various typos cleanups all over. Also fixed some blank space
   indentation.
 - Added review tags

Damien Le Moal (6):
  PCI: endpoint: Introduce pci_epc_function_is_valid()
  PCI: endpoint: Improve pci_epc_mem_alloc_addr()
  PCI: endpoint: Introduce pci_epc_mem_map()/unmap()
  PCI: endpoint: Update documentation
  PCI: endpoint: test: Use pci_epc_mem_map/unmap()
  PCI: dwc: endpoint: Implement the pci_epc_ops::align_addr() operation

 Documentation/PCI/endpoint/pci-endpoint.rst   |  29 ++
 .../pci/controller/dwc/pcie-designware-ep.c   |  15 +
 drivers/pci/endpoint/functions/pci-epf-test.c | 372 +++++++++---------
 drivers/pci/endpoint/pci-epc-core.c           | 182 ++++++---
 drivers/pci/endpoint/pci-epc-mem.c            |   9 +-
 include/linux/pci-epc.h                       |  38 ++
 6 files changed, 415 insertions(+), 230 deletions(-)

Comments

Manivannan Sadhasivam Oct. 12, 2024, 11:57 a.m. UTC | #1
On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote:
> This series introduces the new functions pci_epc_mem_map() and
> pci_epc_mem_unmap() to improve handling of the PCI address mapping
> alignment constraints of endpoint controllers in a controller
> independent manner.
> 
> The issue fixed is that the fixed alignment defined by the "align" field
> of struct pci_epc_features is defined for inbound ATU entries (e.g.
> BARs) and is a fixed value, whereas some controllers need a PCI address
> alignment that depends on the PCI address itself. For instance, the
> rk3399 SoC controller in endpoint mode uses the lower bits of the local
> endpoint memory address as the lower bits for the PCI addresses for data
> transfers. That is, when mapping local memory, one must take into
> account the number of bits of the RC PCI address that change from the
> start address of the mapping.
> 
> To fix this, the new endpoint controller method .align_addr is
> introduced and called from the new pci_epc_mem_map() function. This
> method is optional and for controllers that do not define it, it is
> assumed that the controller has no PCI address constraint.
> 
> The functions pci_epc_mem_map() is a helper function which obtains
> the mapping information, allocates endpoint controller memory according
> to the mapping size obtained and maps the memory. pci_epc_mem_unmap()
> unmaps and frees the endpoint memory.
> 
> This series is organized as follows:
>  - Patch 1 introduces a small helper to clean up the epc core code
>  - Patch 2 improves pci_epc_mem_alloc_addr()
>  - Patch 3 introduce the new align_addr() endpoint controller method
>    and the epc functions pci_epc_mem_map() and pci_epc_mem_unmap().
>  - Patch 4 documents these new functions.
>  - Patch 5 modifies the test endpoint function driver to use 
>    pci_epc_mem_map() and pci_epc_mem_unmap() to illustrate the use of
>    these functions.
>  - Finally, patch 6 implements the RK3588 endpoint controller driver
>    .align_addr() operation to satisfy that controller PCI address
>    alignment constraint.
> 
> This patch series was tested using the pci endpoint test driver (as-is
> and a modified version removing memory allocation alignment on the host
> side) as well as with a prototype NVMe endpoint function driver (where
> data transfers use addresses that are never aligned to any specific
> boundary).
> 

Applied to pci/endpoint!

- Mani

> Changes from v5:
>  - Changed patch 3 to rename the new controller operation to align_addr
>    and change its interface. Patch 6 is changed accordingly.
> 
> Changes from v4:
>  - Removed the patch adding the pci_epc_map_align() function. The former
>    .map_align controller operation is now introduced in patch 3 as
>    "get_mem_map()" and used directly in the new pci_epf_mem_map()
>    function.
>  - Modified the documentation patch 4 to reflect the previous change.
>  - Changed patch 6 title and modified it to rename map_align to
>    get_mem_map in accordance with the new patch 3.
>  - Added review tags
> 
> Changes from v3:
>  - Addressed Niklas comments (improved patch 2 commit message, added
>    comments in the pci_epc_map_align() function in patch 3, typos and
>    improvements in patch 5, patch 7 commit message).
>  - Added review tags
> 
> Changes from v2:
>  - Dropped all patches for the rockchip-ep. These patches will be sent
>    later as a separate series.
>  - Added patch 2 and 5
>  - Added review tags to patch 1
> 
> Changes from v1:
>  - Changed pci_epc_check_func() to pci_epc_function_is_valid() in patch
>    1.
>  - Removed patch "PCI: endpoint: Improve pci_epc_mem_alloc_addr()"
>    (former patch 2 of v1)
>  - Various typos cleanups all over. Also fixed some blank space
>    indentation.
>  - Added review tags
> 
> Damien Le Moal (6):
>   PCI: endpoint: Introduce pci_epc_function_is_valid()
>   PCI: endpoint: Improve pci_epc_mem_alloc_addr()
>   PCI: endpoint: Introduce pci_epc_mem_map()/unmap()
>   PCI: endpoint: Update documentation
>   PCI: endpoint: test: Use pci_epc_mem_map/unmap()
>   PCI: dwc: endpoint: Implement the pci_epc_ops::align_addr() operation
> 
>  Documentation/PCI/endpoint/pci-endpoint.rst   |  29 ++
>  .../pci/controller/dwc/pcie-designware-ep.c   |  15 +
>  drivers/pci/endpoint/functions/pci-epf-test.c | 372 +++++++++---------
>  drivers/pci/endpoint/pci-epc-core.c           | 182 ++++++---
>  drivers/pci/endpoint/pci-epc-mem.c            |   9 +-
>  include/linux/pci-epc.h                       |  38 ++
>  6 files changed, 415 insertions(+), 230 deletions(-)
> 
> -- 
> 2.47.0
>
Damien Le Moal Oct. 12, 2024, 12:03 p.m. UTC | #2
On 10/12/24 20:57, Manivannan Sadhasivam wrote:
> On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote:
>> This series introduces the new functions pci_epc_mem_map() and
>> pci_epc_mem_unmap() to improve handling of the PCI address mapping
>> alignment constraints of endpoint controllers in a controller
>> independent manner.
>>
>> The issue fixed is that the fixed alignment defined by the "align" field
>> of struct pci_epc_features is defined for inbound ATU entries (e.g.
>> BARs) and is a fixed value, whereas some controllers need a PCI address
>> alignment that depends on the PCI address itself. For instance, the
>> rk3399 SoC controller in endpoint mode uses the lower bits of the local
>> endpoint memory address as the lower bits for the PCI addresses for data
>> transfers. That is, when mapping local memory, one must take into
>> account the number of bits of the RC PCI address that change from the
>> start address of the mapping.
>>
>> To fix this, the new endpoint controller method .align_addr is
>> introduced and called from the new pci_epc_mem_map() function. This
>> method is optional and for controllers that do not define it, it is
>> assumed that the controller has no PCI address constraint.
>>
>> The functions pci_epc_mem_map() is a helper function which obtains
>> the mapping information, allocates endpoint controller memory according
>> to the mapping size obtained and maps the memory. pci_epc_mem_unmap()
>> unmaps and frees the endpoint memory.
>>
>> This series is organized as follows:
>>  - Patch 1 introduces a small helper to clean up the epc core code
>>  - Patch 2 improves pci_epc_mem_alloc_addr()
>>  - Patch 3 introduce the new align_addr() endpoint controller method
>>    and the epc functions pci_epc_mem_map() and pci_epc_mem_unmap().
>>  - Patch 4 documents these new functions.
>>  - Patch 5 modifies the test endpoint function driver to use 
>>    pci_epc_mem_map() and pci_epc_mem_unmap() to illustrate the use of
>>    these functions.
>>  - Finally, patch 6 implements the RK3588 endpoint controller driver
>>    .align_addr() operation to satisfy that controller PCI address
>>    alignment constraint.
>>
>> This patch series was tested using the pci endpoint test driver (as-is
>> and a modified version removing memory allocation alignment on the host
>> side) as well as with a prototype NVMe endpoint function driver (where
>> data transfers use addresses that are never aligned to any specific
>> boundary).
>>
> 
> Applied to pci/endpoint!

Awesome ! Thanks a lot !
Bjorn Helgaas Oct. 21, 2024, 10:19 p.m. UTC | #3
On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote:
> This series introduces the new functions pci_epc_mem_map() and
> pci_epc_mem_unmap() to improve handling of the PCI address mapping
> alignment constraints of endpoint controllers in a controller
> independent manner.

Hi Damien, I know this is obvious to everybody except me, but who uses
this mapping?  A driver running on the endpoint that does MMIO?  DMA
(Memory Reads or Writes that target an endpoint BAR)?  I'm still
trying to wrap my head around the whole endpoint driver model.

Bjorn
Damien Le Moal Oct. 22, 2024, 1:51 a.m. UTC | #4
On 10/22/24 07:19, Bjorn Helgaas wrote:
> On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote:
>> This series introduces the new functions pci_epc_mem_map() and
>> pci_epc_mem_unmap() to improve handling of the PCI address mapping
>> alignment constraints of endpoint controllers in a controller
>> independent manner.
> 
> Hi Damien, I know this is obvious to everybody except me, but who uses
> this mapping?  A driver running on the endpoint that does MMIO?  DMA
> (Memory Reads or Writes that target an endpoint BAR)?  I'm still
> trying to wrap my head around the whole endpoint driver model.

The mapping API is for mmio or DMA using enpoint controller memory mapped to a
host PCI address range. It is not for BARs. BARs setup does not use the same API
and has not changed with these patches.

BARs can still be accessed on the enpoint (within the EPF driver) with regular
READ_ONCE()/WRITE_ONCE() once they are set. But any data transfer between the
PCI RC host and the EPF driver on the EP host that use mmio or DMA generic
channel (memory copy offload channel) needs the new mapping API. DMA transfers
that can be done using dedicated DMA rx/tx channels associated with the endpoint
controller do not need to use this API as the mapping to the host PCI address
space is automatically handled by the DMA driver.
Niklas Cassel Oct. 22, 2024, 8:38 a.m. UTC | #5
On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote:
> On 10/22/24 07:19, Bjorn Helgaas wrote:
> > On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote:
>
> DMA transfers that can be done using dedicated DMA rx/tx channels associated
> with the endpoint controller do not need to use this API as the mapping to
> the host PCI address space is automatically handled by the DMA driver.

I disagree with this part. It think that it depends on the DMA controller.

Looking at the manual for e.g. the embedded DMA controller on the DWC
controller (eDMA):
""
When you do not want the iATU to translate outbound requests that are generated by the
internal DMA module, then you must implement one of the following approaches:
- Ensure that the combination of DMA channel programming and iATU control register
programming, causes no translation of DMA traffic to be done in the iATU.
or
- Activate the DMA bypass mode to allow request TLPs which are initiated by the DMA
controller to pass through the iATU untranslated. You can activate DMA bypass mode by
setting the DMA Bypass field of the iATU Control 2 Register (IATU_REGION_C-
TRL_OFF_2_OUTBOUND_0).
""

However, we also know that, if there is no match in the iATU table:
""
The default behavior of the ATU when there is no address match in the outbound direction or no
TLP attribute match in the inbound direction, is to pass the transaction through.
""

I.e. if there is a iATU mapping (which is what pci_epc_map_addr() sets up),
the eDMA will take that into account. If there is none, it will go through
untranslated.


So for the eDMA embedded on the DWC controller, we do not strictly need to
call pci_epc_map_addr(). (However, we probably want to do it anyway, as long
as we haven't enabled DMA bypass mode, just to make sure that there is no
other iATU mapping in the mapping table that will affect our transfer.)

However, if you think about a generic DMA controller, e.g. an ARM primecell
pl330, I don't see how it that DMA controller will be able to perform
transfers correctly if there is not an iATU mapping for the region that it
is reading/writing to.

So the safest thing, that will work with any DMA controller is probably to
always call pci_epc_map_addr() before doing the transfer.


However, as I pointed out in:
https://lore.kernel.org/lkml/Zg5oeDzq5u3jmKIu@ryzen/

This behavior is still inconsistent between PCI EPF drivers:
E.g. for pci-epf-test.c:
When using dedicated DMA rx/tx channels (epf_test->dma_private == true),
and when FLAG_USE_DMA is set, it currently always calls pci_epc_map_addr()
before doing the transfer.

However for pci-epf-mhi.c:
When using dedicated DMA rx/tx channels and when MHI_EPF_USE_DMA is set,
it currently does not call pci_epc_map_addr() before doing the transfer.


Kind regards,
Niklas
Damien Le Moal Oct. 22, 2024, 11:57 a.m. UTC | #6
On 10/22/24 17:38, Niklas Cassel wrote:
> On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote:
>> On 10/22/24 07:19, Bjorn Helgaas wrote:
>>> On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote:
>>
>> DMA transfers that can be done using dedicated DMA rx/tx channels associated
>> with the endpoint controller do not need to use this API as the mapping to
>> the host PCI address space is automatically handled by the DMA driver.
> 
> I disagree with this part. It think that it depends on the DMA controller.
> 
> Looking at the manual for e.g. the embedded DMA controller on the DWC
> controller (eDMA):
> ""
> When you do not want the iATU to translate outbound requests that are generated by the
> internal DMA module, then you must implement one of the following approaches:
> - Ensure that the combination of DMA channel programming and iATU control register
> programming, causes no translation of DMA traffic to be done in the iATU.
> or
> - Activate the DMA bypass mode to allow request TLPs which are initiated by the DMA
> controller to pass through the iATU untranslated. You can activate DMA bypass mode by
> setting the DMA Bypass field of the iATU Control 2 Register (IATU_REGION_C-
> TRL_OFF_2_OUTBOUND_0).
> ""
> 
> However, we also know that, if there is no match in the iATU table:
> ""
> The default behavior of the ATU when there is no address match in the outbound direction or no
> TLP attribute match in the inbound direction, is to pass the transaction through.
> ""
> 
> I.e. if there is a iATU mapping (which is what pci_epc_map_addr() sets up),
> the eDMA will take that into account. If there is none, it will go through
> untranslated.
> 
> 
> So for the eDMA embedded on the DWC controller, we do not strictly need to
> call pci_epc_map_addr(). (However, we probably want to do it anyway, as long
> as we haven't enabled DMA bypass mode, just to make sure that there is no
> other iATU mapping in the mapping table that will affect our transfer.)
> 
> However, if you think about a generic DMA controller, e.g. an ARM primecell
> pl330, I don't see how it that DMA controller will be able to perform
> transfers correctly if there is not an iATU mapping for the region that it
> is reading/writing to.
> 
> So the safest thing, that will work with any DMA controller is probably to
> always call pci_epc_map_addr() before doing the transfer.

Indeed, I think you are right. Not doing the mapping before DMA with the RK3588
works fine, but that may not be the case for all controllers.
Manivannan Sadhasivam Oct. 22, 2024, 1:56 p.m. UTC | #7
On Tue, Oct 22, 2024 at 10:38:58AM +0200, Niklas Cassel wrote:
> On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote:
> > On 10/22/24 07:19, Bjorn Helgaas wrote:
> > > On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote:
> >
> > DMA transfers that can be done using dedicated DMA rx/tx channels associated
> > with the endpoint controller do not need to use this API as the mapping to
> > the host PCI address space is automatically handled by the DMA driver.
> 
> I disagree with this part. It think that it depends on the DMA controller.
> 
> Looking at the manual for e.g. the embedded DMA controller on the DWC
> controller (eDMA):
> ""
> When you do not want the iATU to translate outbound requests that are generated by the
> internal DMA module, then you must implement one of the following approaches:
> - Ensure that the combination of DMA channel programming and iATU control register
> programming, causes no translation of DMA traffic to be done in the iATU.
> or
> - Activate the DMA bypass mode to allow request TLPs which are initiated by the DMA
> controller to pass through the iATU untranslated. You can activate DMA bypass mode by
> setting the DMA Bypass field of the iATU Control 2 Register (IATU_REGION_C-
> TRL_OFF_2_OUTBOUND_0).
> ""
> 
> However, we also know that, if there is no match in the iATU table:
> ""
> The default behavior of the ATU when there is no address match in the outbound direction or no
> TLP attribute match in the inbound direction, is to pass the transaction through.
> ""
> 
> I.e. if there is a iATU mapping (which is what pci_epc_map_addr() sets up),
> the eDMA will take that into account. If there is none, it will go through
> untranslated.
> 
> 
> So for the eDMA embedded on the DWC controller, we do not strictly need to
> call pci_epc_map_addr(). (However, we probably want to do it anyway, as long
> as we haven't enabled DMA bypass mode, just to make sure that there is no
> other iATU mapping in the mapping table that will affect our transfer.)
> 

I do not recommend that. EPF driver *should* know how to isolate the address
spaces between DMA and iATU. Creating the iATU mapping for the DMA address is
just defeating the purpose of using DMA.

If any overlap mapping is present, then the EPF driver has a bug!

> However, if you think about a generic DMA controller, e.g. an ARM primecell
> pl330, I don't see how it that DMA controller will be able to perform
> transfers correctly if there is not an iATU mapping for the region that it
> is reading/writing to.
> 

I don't think the generic DMA controller can be used to read/write to remote
memory. It needs to be integrated with the PCIe IP so that it can issue PCIe
transactions.

> So the safest thing, that will work with any DMA controller is probably to
> always call pci_epc_map_addr() before doing the transfer.
> 
> 
> However, as I pointed out in:
> https://lore.kernel.org/lkml/Zg5oeDzq5u3jmKIu@ryzen/
> 
> This behavior is still inconsistent between PCI EPF drivers:
> E.g. for pci-epf-test.c:
> When using dedicated DMA rx/tx channels (epf_test->dma_private == true),
> and when FLAG_USE_DMA is set, it currently always calls pci_epc_map_addr()
> before doing the transfer.
> 
> However for pci-epf-mhi.c:
> When using dedicated DMA rx/tx channels and when MHI_EPF_USE_DMA is set,
> it currently does not call pci_epc_map_addr() before doing the transfer.
> 

Because, there are not going to be any overlapping mappings. But I agree with
the inconsistency between EPF drivers though...

- Mani
Niklas Cassel Oct. 22, 2024, 2:16 p.m. UTC | #8
On Tue, Oct 22, 2024 at 07:26:31PM +0530, Manivannan Sadhasivam wrote:
> On Tue, Oct 22, 2024 at 10:38:58AM +0200, Niklas Cassel wrote:
> > On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote:
> > > On 10/22/24 07:19, Bjorn Helgaas wrote:
> > > > On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote:
> 
> > However, if you think about a generic DMA controller, e.g. an ARM primecell
> > pl330, I don't see how it that DMA controller will be able to perform
> > transfers correctly if there is not an iATU mapping for the region that it
> > is reading/writing to.
> > 
> 
> I don't think the generic DMA controller can be used to read/write to remote
> memory. It needs to be integrated with the PCIe IP so that it can issue PCIe
> transactions.

I'm not an expert, so I might of course be misunderstanding how things work.

When the CPU performs an AXI read/write to a MMIO address within the PCIe
controller (specifically the PCIe controller's outbound memory window),
the PCIe controller translates that incoming read/write to a read/write on the
PCIe bus.

(The PCI address of the generated PCIe transaction will depend on how the iATU
has been configured, which determines how reads/writes to the PCIe controller's
outbound memory window should be translated to PCIe addresses.)

If that is how it works when the CPU does the AXI read/write, why wouldn't
things work the same if it is an generic DMA controller performing the AXI
read/write to the MMIO address targeting the PCIe controller's outbound memory
window?


Kind regards,
Niklas
Frank Li Oct. 22, 2024, 3:18 p.m. UTC | #9
On Tue, Oct 22, 2024 at 04:16:24PM +0200, Niklas Cassel wrote:
> On Tue, Oct 22, 2024 at 07:26:31PM +0530, Manivannan Sadhasivam wrote:
> > On Tue, Oct 22, 2024 at 10:38:58AM +0200, Niklas Cassel wrote:
> > > On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote:
> > > > On 10/22/24 07:19, Bjorn Helgaas wrote:
> > > > > On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote:
> >
> > > However, if you think about a generic DMA controller, e.g. an ARM primecell
> > > pl330, I don't see how it that DMA controller will be able to perform
> > > transfers correctly if there is not an iATU mapping for the region that it
> > > is reading/writing to.
> > >
> >
> > I don't think the generic DMA controller can be used to read/write to remote
> > memory. It needs to be integrated with the PCIe IP so that it can issue PCIe
> > transactions.
>
> I'm not an expert, so I might of course be misunderstanding how things work.
>
> When the CPU performs an AXI read/write to a MMIO address within the PCIe
> controller (specifically the PCIe controller's outbound memory window),
> the PCIe controller translates that incoming read/write to a read/write on the
> PCIe bus.
>
> (The PCI address of the generated PCIe transaction will depend on how the iATU
> has been configured, which determines how reads/writes to the PCIe controller's
> outbound memory window should be translated to PCIe addresses.)
>
> If that is how it works when the CPU does the AXI read/write, why wouldn't
> things work the same if it is an generic DMA controller performing the AXI
> read/write to the MMIO address targeting the PCIe controller's outbound memory
> window?

generic DMA controller can preforming memory to memory (PCI map windows) to
do data transfer. But MMIO need map by iATU of pci controller.

for example: copy 0x4000_0000 to PCI host's 0xA_0000_0000

EP memory (0x4000_0000), -> DMA -> PCI map windows (iATU) (0x8000_0000)  ->
PCI bus (0xA_0000_0000-> Host memory (0xA_0000_0000).

But embedded DMA can bypass iATU. Directy send out TLP.

EP memory (0x4000_0000) -> PCI controller DMA -> PCI BUS (0xA_0000_0000)
-> Host memmory (0xA_0000_0000)

anthor words, eDMA can copy data to any place of PCI host memory.
generally DMA only copy data to EP PCI map window.

Frank


>
>
> Kind regards,
> Niklas
Manivannan Sadhasivam Oct. 22, 2024, 3:30 p.m. UTC | #10
On Tue, Oct 22, 2024 at 04:16:24PM +0200, Niklas Cassel wrote:
> On Tue, Oct 22, 2024 at 07:26:31PM +0530, Manivannan Sadhasivam wrote:
> > On Tue, Oct 22, 2024 at 10:38:58AM +0200, Niklas Cassel wrote:
> > > On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote:
> > > > On 10/22/24 07:19, Bjorn Helgaas wrote:
> > > > > On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote:
> > 
> > > However, if you think about a generic DMA controller, e.g. an ARM primecell
> > > pl330, I don't see how it that DMA controller will be able to perform
> > > transfers correctly if there is not an iATU mapping for the region that it
> > > is reading/writing to.
> > > 
> > 
> > I don't think the generic DMA controller can be used to read/write to remote
> > memory. It needs to be integrated with the PCIe IP so that it can issue PCIe
> > transactions.
> 
> I'm not an expert, so I might of course be misunderstanding how things work.
> 

Neither am I :) I'm just sharing my understanding based on reading the DWC spec
and open to get corrected if I'm wrong.

> When the CPU performs an AXI read/write to a MMIO address within the PCIe
> controller (specifically the PCIe controller's outbound memory window),
> the PCIe controller translates that incoming read/write to a read/write on the
> PCIe bus.
> 

I don't think the *PCIe controller* translates the read/writes, but the iATU. If
we use iATU, then the remote address needs to be mapped to the endpoint DDR and
if CPU performs AXI read/write to that address, then iATU will translate the DDR
address to remote address and then issue PCIe transactions (together with the
PCIe controller).

And if DMA is used, then DMA controller can issue PCIe transactions to the
remote memory itself (again, together with the PCIe controller). So no mapping
is required here.

- Mani
Bjorn Helgaas Oct. 22, 2024, 8:47 p.m. UTC | #11
On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote:
> On 10/22/24 07:19, Bjorn Helgaas wrote:
> > On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote:
> >> This series introduces the new functions pci_epc_mem_map() and
> >> pci_epc_mem_unmap() to improve handling of the PCI address mapping
> >> alignment constraints of endpoint controllers in a controller
> >> independent manner.
> > 
> > Hi Damien, I know this is obvious to everybody except me, but who
> > uses this mapping?  A driver running on the endpoint that does
> > MMIO?  DMA (Memory Reads or Writes that target an endpoint BAR)?
> > I'm still trying to wrap my head around the whole endpoint driver
> > model.
> 
> The mapping API is for mmio or DMA using endpoint controller memory
> mapped to a host PCI address range. It is not for BARs. BARs setup
> does not use the same API and has not changed with these patches.
> 
> BARs can still be accessed on the endpoint (within the EPF driver)
> with regular READ_ONCE()/WRITE_ONCE() once they are set. But any
> data transfer between the PCI RC host and the EPF driver on the EP
> host that use mmio or DMA generic channel (memory copy offload
> channel) needs the new mapping API. DMA transfers that can be done
> using dedicated DMA rx/tx channels associated with the endpoint
> controller do not need to use this API as the mapping to the host
> PCI address space is automatically handled by the DMA driver.

Sorry I'm dense.  I'm really not used to thinking in the endpoint
point of view.  Correct me where I go off the rails:

  - This code (pci_epc_mem_map()) runs on an endpoint, basically as
    part of some endpoint firmware, right?

  - On the endpoint's PCIe side, it receives memory read/write TLPs?

  - These TLPs would be generated elsewhere in the PCIe fabric, e.g.,
    by a driver on the host doing MMIO to the endpoint, or possibly
    another endpoint doing peer-to-peer DMA.

  - Mem read/write TLPs are routed by address, and the endpoint
    accepts them if the address matches one of its BARs.

  - This is a little different from a Root Complex, which would
    basically treat reads/writes to anything outside the Root Port
    windows as incoming DMA headed to physical memory.

  - A Root Complex would use the TLP address (the PCI bus address)
    directly as a CPU physical memory address unless the TLP address
    is translated by an IOMMU.

  - For the endpoint, you want the BAR to be an aperture to physical
    memory in the address space of the SoC driving the endpoint.

  - The SoC physical memory address may be different from the PCI but
    address in the TLP, and pci_epc_mem_map() is the way to account
    for this?

  - IOMMU translations from PCI to CPU physical address space are
    pretty arbitrary and needn't be contiguous on the CPU side.

  - pci_epc_mem_map() sets up a conceptually similar PCI to CPU
    address space translation, but it's much simpler because it
    basically applies just a constant offset?
Damien Le Moal Oct. 22, 2024, 10:05 p.m. UTC | #12
On 10/23/24 05:47, Bjorn Helgaas wrote:
> On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote:
>> On 10/22/24 07:19, Bjorn Helgaas wrote:
>>> On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote:
>>>> This series introduces the new functions pci_epc_mem_map() and
>>>> pci_epc_mem_unmap() to improve handling of the PCI address mapping
>>>> alignment constraints of endpoint controllers in a controller
>>>> independent manner.
>>>
>>> Hi Damien, I know this is obvious to everybody except me, but who
>>> uses this mapping?  A driver running on the endpoint that does
>>> MMIO?  DMA (Memory Reads or Writes that target an endpoint BAR)?
>>> I'm still trying to wrap my head around the whole endpoint driver
>>> model.
>>
>> The mapping API is for mmio or DMA using endpoint controller memory
>> mapped to a host PCI address range. It is not for BARs. BARs setup
>> does not use the same API and has not changed with these patches.
>>
>> BARs can still be accessed on the endpoint (within the EPF driver)
>> with regular READ_ONCE()/WRITE_ONCE() once they are set. But any
>> data transfer between the PCI RC host and the EPF driver on the EP
>> host that use mmio or DMA generic channel (memory copy offload
>> channel) needs the new mapping API. DMA transfers that can be done
>> using dedicated DMA rx/tx channels associated with the endpoint
>> controller do not need to use this API as the mapping to the host
>> PCI address space is automatically handled by the DMA driver.
> 
> Sorry I'm dense.  I'm really not used to thinking in the endpoint
> point of view.  Correct me where I go off the rails:
> 
>   - This code (pci_epc_mem_map()) runs on an endpoint, basically as
>     part of some endpoint firmware, right?

Not sure what you mean by "firmware" here. pci_epc_mem_map() is intended for use
by a PCI endpoint function driver on the EP side, nothing else.

>   - On the endpoint's PCIe side, it receives memory read/write TLPs?

pci_epc_mem_map() does not receive anything itself. It only allocates local EP
controller memory from outbound ATU windows and map that to a PCI address region
of the host (RC side). Here "map" means programming the ATU using a correctly
aligned PCI address that satisfies the EP PCI controller alignment constraints.

The memory read/write TLPs will happen once the EP function driver access the
mapped memory with memcpy_fromio/toio() (or use generic memory copy offload DMA
channel).

>   - These TLPs would be generated elsewhere in the PCIe fabric, e.g.,
>     by a driver on the host doing MMIO to the endpoint, or possibly
>     another endpoint doing peer-to-peer DMA.

MMIO or DMA, yes.

> 
>   - Mem read/write TLPs are routed by address, and the endpoint
>     accepts them if the address matches one of its BARs.

Yes, but pci_epc_mem_map() is not for use for BARs on the EP side. BARs setup on
EP PCI controllers use inbound ATU windows. pci_epc_mem_map() programs outbound
ATU windows. BARs setup use pci_epf_alloc_space() + pci_epc_set_bar() to setup a
BAR.

> 
>   - This is a little different from a Root Complex, which would
>     basically treat reads/writes to anything outside the Root Port
>     windows as incoming DMA headed to physical memory.
> 
>   - A Root Complex would use the TLP address (the PCI bus address)
>     directly as a CPU physical memory address unless the TLP address
>     is translated by an IOMMU.
> 
>   - For the endpoint, you want the BAR to be an aperture to physical
>     memory in the address space of the SoC driving the endpoint.

Yes, and as said above this is done with pci_epf_alloc_space() +
pci_epc_set_bar(), not pci_epc_mem_map().

>   - The SoC physical memory address may be different from the PCI but
>     address in the TLP, and pci_epc_mem_map() is the way to account
>     for this?

pci_epc_mem_map() is essentially a call to pci_epc_mem_alloc_addr() +
pci_epc_map_addr(). pci_epc_mem_alloc_addr() allocates memory from the EP PCI
controller outbound windows and pci_epc_map_addr() programs an EP PCI controller
outbound ATU entry to map that memory to some PCI address region of the host (RC).

pci_epc_mem_map() was created because the current epc API does not hide any of
the EP PCI controller PCI address alignment constraints for programming outbound
ATU entries. This means that using directly pci_epc_mem_alloc_addr() +
pci_epc_map_addr(), things may or may not work (often the latter) depending on
the PCI address region (start address and its size) that is to be mapped and
accessed by the endpoint function driver.

Most EP PCI controllers at least require a PCI address to be aligned to some
fixed boundary (e.g. 64K for the RK3588 SoC on the Rock5B board). Even such
alignment boundary/mask is not exposed through the API (epc_features->align is
for BARs and should not be used for outbound ATU programming). Worse yet, some
EP PCI controllers use the lower bits of a local memory window address range as
the lower bits of the RC PCI address range when generating TLPs (e.g. RK3399 and
Cadence EP controller). For these EP PCI controllers, the programming of
outbound ATU entries for mapping a RC PCI address region thus endup depending on
the start PCI address of the region *and* the region size (as the size drives
the number of lower bits of address that will change over the entire region).

The above API issues where essentially unsolvable from a regular endpoint driver
correctly coded to be EP controller independent. I could not get my prototype
nvme endpoint function driver to work without introducing pci_epc_mem_map()
(nvme does not require any special alignment of command buffers, so on an nvme
EPF driver we end up having to deal with essentially random PCI addresses that
have no special alignment).

pci_epc_mem_map() relies on the new ->align_addr() EP controller operation to
get information about the start address and the size to use for mapping to local
memory a PCI address region. The size given is used for pci_epc_mem_alloc_addr()
and the start address and size given are used with pci_epc_map_addr(). This
generates a correct mapping regardless of the PCI address and size given.
And for the actual data access by the EP function driver, pci_epc_mem_map()
gives the address within the mapping that correspond to the PCI address that we
wanted mapped.

Et voila, problem solved :)

> 
>   - IOMMU translations from PCI to CPU physical address space are
>     pretty arbitrary and needn't be contiguous on the CPU side.
> 
>   - pci_epc_mem_map() sets up a conceptually similar PCI to CPU
>     address space translation, but it's much simpler because it
>     basically applies just a constant offset?
Damien Le Moal Oct. 22, 2024, 10:12 p.m. UTC | #13
On 10/23/24 00:30, Manivannan Sadhasivam wrote:
> On Tue, Oct 22, 2024 at 04:16:24PM +0200, Niklas Cassel wrote:
>> On Tue, Oct 22, 2024 at 07:26:31PM +0530, Manivannan Sadhasivam wrote:
>>> On Tue, Oct 22, 2024 at 10:38:58AM +0200, Niklas Cassel wrote:
>>>> On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote:
>>>>> On 10/22/24 07:19, Bjorn Helgaas wrote:
>>>>>> On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote:
>>>
>>>> However, if you think about a generic DMA controller, e.g. an ARM primecell
>>>> pl330, I don't see how it that DMA controller will be able to perform
>>>> transfers correctly if there is not an iATU mapping for the region that it
>>>> is reading/writing to.
>>>>
>>>
>>> I don't think the generic DMA controller can be used to read/write to remote
>>> memory. It needs to be integrated with the PCIe IP so that it can issue PCIe
>>> transactions.
>>
>> I'm not an expert, so I might of course be misunderstanding how things work.
>>
> 
> Neither am I :) I'm just sharing my understanding based on reading the DWC spec
> and open to get corrected if I'm wrong.
> 
>> When the CPU performs an AXI read/write to a MMIO address within the PCIe
>> controller (specifically the PCIe controller's outbound memory window),
>> the PCIe controller translates that incoming read/write to a read/write on the
>> PCIe bus.
>>
> 
> I don't think the *PCIe controller* translates the read/writes, but the iATU. If
> we use iATU, then the remote address needs to be mapped to the endpoint DDR and
> if CPU performs AXI read/write to that address, then iATU will translate the DDR
> address to remote address and then issue PCIe transactions (together with the
> PCIe controller).
> 
> And if DMA is used, then DMA controller can issue PCIe transactions to the
> remote memory itself (again, together with the PCIe controller). So no mapping
> is required here.

Caveat: DMA here cannot be the case "using the generic memory copy offload DMA
channel". It must be DMA driven by hardware rx/tx DMA channels. For memory copy
offload DMA channel, an EPF must map the address range to "DMA" to (which is
really memcpy_toio/fromio() but the DMA API will hide that...).

Would love to have so EPF wrapper API around all that mess to simplify EPF
drivers. They all end up having to do the exact same things.

> 
> - Mani
>
Bjorn Helgaas Oct. 22, 2024, 11:49 p.m. UTC | #14
On Wed, Oct 23, 2024 at 07:05:54AM +0900, Damien Le Moal wrote:
> On 10/23/24 05:47, Bjorn Helgaas wrote:
> > On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote:
> >> On 10/22/24 07:19, Bjorn Helgaas wrote:
> >>> On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote:
> >>>> This series introduces the new functions pci_epc_mem_map() and
> >>>> pci_epc_mem_unmap() to improve handling of the PCI address mapping
> >>>> alignment constraints of endpoint controllers in a controller
> >>>> independent manner.
> >>>
> >>> Hi Damien, I know this is obvious to everybody except me, but who
> >>> uses this mapping?  A driver running on the endpoint that does
> >>> MMIO?  DMA (Memory Reads or Writes that target an endpoint BAR)?
> >>> I'm still trying to wrap my head around the whole endpoint driver
> >>> model.
> >>
> >> The mapping API is for mmio or DMA using endpoint controller memory
> >> mapped to a host PCI address range. It is not for BARs. BARs setup
> >> does not use the same API and has not changed with these patches.
> >>
> >> BARs can still be accessed on the endpoint (within the EPF driver)
> >> with regular READ_ONCE()/WRITE_ONCE() once they are set. But any
> >> data transfer between the PCI RC host and the EPF driver on the EP
> >> host that use mmio or DMA generic channel (memory copy offload
> >> channel) needs the new mapping API. DMA transfers that can be done
> >> using dedicated DMA rx/tx channels associated with the endpoint
> >> controller do not need to use this API as the mapping to the host
> >> PCI address space is automatically handled by the DMA driver.
> > 
> > Sorry I'm dense.  I'm really not used to thinking in the endpoint
> > point of view.  Correct me where I go off the rails:
> > 
> >   - This code (pci_epc_mem_map()) runs on an endpoint, basically as
> >     part of some endpoint firmware, right?
> 
> Not sure what you mean by "firmware" here. pci_epc_mem_map() is
> intended for use by a PCI endpoint function driver on the EP side,
> nothing else.

I mean the software operating the endpoint from the "inside", not from
the PCIe side.

In a non-endpoint framework scenario, the kernel running on the host
enumerates PCI devices using config reads, and a driver like nvme runs
on the host and operates an NVMe device with config accesses and MMIO
to the NVMe BARs.

I referred to "firmware" because if the endpoint were on a PCIe
plug-in card, an SoC on the card could run firmware that uses the
endpoint framework to respond to PCIe activity generated by the host.
(Of course, if the host driver programs the endpoint appropriately,
the endpoint might also initiate its own PCIe activity for DMA.)

> >   - On the endpoint's PCIe side, it receives memory read/write TLPs?
> 
> pci_epc_mem_map() does not receive anything itself. It only
> allocates local EP controller memory from outbound ATU windows and
> map that to a PCI address region of the host (RC side). Here "map"
> means programming the ATU using a correctly aligned PCI address that
> satisfies the EP PCI controller alignment constraints.

I assumed we were talking about an endpoint responding to PCIe traffic
generated elsewhere, but I think my assumption was wrong.

I don't have any specs for these endpoint controllers, and there's
nothing in Documentation/PCI/endpoint/ about ATUs.  Some doc and a few
pictures would go a long ways.

> The memory read/write TLPs will happen once the EP function driver
> access the mapped memory with memcpy_fromio/toio() (or use generic
> memory copy offload DMA channel).

This would be the endpoint itself *initiating* DMA TLPs on PCIe, not
responding to incoming DMA, I guess.

> >   - These TLPs would be generated elsewhere in the PCIe fabric, e.g.,
> >     by a driver on the host doing MMIO to the endpoint, or possibly
> >     another endpoint doing peer-to-peer DMA.
> 
> MMIO or DMA, yes.
> 
> >   - Mem read/write TLPs are routed by address, and the endpoint
> >     accepts them if the address matches one of its BARs.
> 
> Yes, but pci_epc_mem_map() is not for use for BARs on the EP side.
> BARs setup on EP PCI controllers use inbound ATU windows.
> pci_epc_mem_map() programs outbound ATU windows. BARs setup use
> pci_epf_alloc_space() + pci_epc_set_bar() to setup a BAR.
>
> >   - This is a little different from a Root Complex, which would
> >     basically treat reads/writes to anything outside the Root Port
> >     windows as incoming DMA headed to physical memory.
> > 
> >   - A Root Complex would use the TLP address (the PCI bus address)
> >     directly as a CPU physical memory address unless the TLP address
> >     is translated by an IOMMU.
> > 
> >   - For the endpoint, you want the BAR to be an aperture to physical
> >     memory in the address space of the SoC driving the endpoint.
> 
> Yes, and as said above this is done with pci_epf_alloc_space() +
> pci_epc_set_bar(), not pci_epc_mem_map().

And if the endpoint is *initiating* DMA, BARs of that endpoint are not
involved at all.  The target of the DMA would be either host memory or
a BAR of another endpoint.

> >   - The SoC physical memory address may be different from the PCI but
> >     address in the TLP, and pci_epc_mem_map() is the way to account
> >     for this?
> 
> pci_epc_mem_map() is essentially a call to pci_epc_mem_alloc_addr()
> + pci_epc_map_addr(). pci_epc_mem_alloc_addr() allocates memory from
> the EP PCI controller outbound windows and pci_epc_map_addr()
> programs an EP PCI controller outbound ATU entry to map that memory
> to some PCI address region of the host (RC).

I see "RC" and "RC address" used often in this area, but I don't quite
understand the RC connection.  It sounds like this might be the PCI
memory address space, i.e., the set of addresses that might appear in
PCIe TLPs?

Does "EPC addr space" refer to the physical address space of the
processor operating the endpoint, or does it mean something more
specific?

> pci_epc_mem_map() was created because the current epc API does not
> hide any of the EP PCI controller PCI address alignment constraints
> for programming outbound ATU entries. This means that using directly
> pci_epc_mem_alloc_addr() + pci_epc_map_addr(), things may or may not
> work (often the latter) depending on the PCI address region (start
> address and its size) that is to be mapped and accessed by the
> endpoint function driver.
> 
> Most EP PCI controllers at least require a PCI address to be aligned
> to some fixed boundary (e.g. 64K for the RK3588 SoC on the Rock5B
> board). Even such alignment boundary/mask is not exposed through the
> API (epc_features->align is for BARs and should not be used for
> outbound ATU programming). Worse yet, some EP PCI controllers use
> the lower bits of a local memory window address range as the lower
> bits of the RC PCI address range when generating TLPs (e.g. RK3399
> and Cadence EP controller). For these EP PCI controllers, the
> programming of outbound ATU entries for mapping a RC PCI address
> region thus endup depending on the start PCI address of the region
> *and* the region size (as the size drives the number of lower bits
> of address that will change over the entire region).
> 
> The above API issues where essentially unsolvable from a regular
> endpoint driver correctly coded to be EP controller independent. I
> could not get my prototype nvme endpoint function driver to work
> without introducing pci_epc_mem_map() (nvme does not require any
> special alignment of command buffers, so on an nvme EPF driver we
> end up having to deal with essentially random PCI addresses that
> have no special alignment).
> 
> pci_epc_mem_map() relies on the new ->align_addr() EP controller
> operation to get information about the start address and the size to
> use for mapping to local memory a PCI address region. The size given
> is used for pci_epc_mem_alloc_addr() and the start address and size
> given are used with pci_epc_map_addr(). This generates a correct
> mapping regardless of the PCI address and size given.  And for the
> actual data access by the EP function driver, pci_epc_mem_map()
> gives the address within the mapping that correspond to the PCI
> address that we wanted mapped.

> >   - IOMMU translations from PCI to CPU physical address space are
> >     pretty arbitrary and needn't be contiguous on the CPU side.
> > 
> >   - pci_epc_mem_map() sets up a conceptually similar PCI to CPU
> >     address space translation, but it's much simpler because it
> >     basically applies just a constant offset?
Damien Le Moal Oct. 23, 2024, 2:51 a.m. UTC | #15
On 10/23/24 08:49, Bjorn Helgaas wrote:
> On Wed, Oct 23, 2024 at 07:05:54AM +0900, Damien Le Moal wrote:
>> On 10/23/24 05:47, Bjorn Helgaas wrote:
>>> On Tue, Oct 22, 2024 at 10:51:53AM +0900, Damien Le Moal wrote:
>>>> On 10/22/24 07:19, Bjorn Helgaas wrote:
>>>>> On Sat, Oct 12, 2024 at 08:32:40PM +0900, Damien Le Moal wrote:
>>>>>> This series introduces the new functions pci_epc_mem_map() and
>>>>>> pci_epc_mem_unmap() to improve handling of the PCI address mapping
>>>>>> alignment constraints of endpoint controllers in a controller
>>>>>> independent manner.
>>>>>
>>>>> Hi Damien, I know this is obvious to everybody except me, but who
>>>>> uses this mapping?  A driver running on the endpoint that does
>>>>> MMIO?  DMA (Memory Reads or Writes that target an endpoint BAR)?
>>>>> I'm still trying to wrap my head around the whole endpoint driver
>>>>> model.
>>>>
>>>> The mapping API is for mmio or DMA using endpoint controller memory
>>>> mapped to a host PCI address range. It is not for BARs. BARs setup
>>>> does not use the same API and has not changed with these patches.
>>>>
>>>> BARs can still be accessed on the endpoint (within the EPF driver)
>>>> with regular READ_ONCE()/WRITE_ONCE() once they are set. But any
>>>> data transfer between the PCI RC host and the EPF driver on the EP
>>>> host that use mmio or DMA generic channel (memory copy offload
>>>> channel) needs the new mapping API. DMA transfers that can be done
>>>> using dedicated DMA rx/tx channels associated with the endpoint
>>>> controller do not need to use this API as the mapping to the host
>>>> PCI address space is automatically handled by the DMA driver.
>>>
>>> Sorry I'm dense.  I'm really not used to thinking in the endpoint
>>> point of view.  Correct me where I go off the rails:
>>>
>>>   - This code (pci_epc_mem_map()) runs on an endpoint, basically as
>>>     part of some endpoint firmware, right?
>>
>> Not sure what you mean by "firmware" here. pci_epc_mem_map() is
>> intended for use by a PCI endpoint function driver on the EP side,
>> nothing else.
> 
> I mean the software operating the endpoint from the "inside", not from
> the PCIe side.

Got it. So yes, a real device firmware, or a PCI endpoint function driver for a
machine running the PCI endpoint framework.

> In a non-endpoint framework scenario, the kernel running on the host
> enumerates PCI devices using config reads, and a driver like nvme runs
> on the host and operates an NVMe device with config accesses and MMIO
> to the NVMe BARs.

Yep, I know that.

>> pci_epc_mem_map() does not receive anything itself. It only
>> allocates local EP controller memory from outbound ATU windows and
>> map that to a PCI address region of the host (RC side). Here "map"
>> means programming the ATU using a correctly aligned PCI address that
>> satisfies the EP PCI controller alignment constraints.
> 
> I assumed we were talking about an endpoint responding to PCIe traffic
> generated elsewhere, but I think my assumption was wrong.

Not at all. See below.

> I don't have any specs for these endpoint controllers, and there's
> nothing in Documentation/PCI/endpoint/ about ATUs.  Some doc and a few
> pictures would go a long ways.

Agree. The PCI endpoint API documentation is far from great and should be
improved. Will try to do something about it.

>> The memory read/write TLPs will happen once the EP function driver
>> access the mapped memory with memcpy_fromio/toio() (or use generic
>> memory copy offload DMA channel).
> 
> This would be the endpoint itself *initiating* DMA TLPs on PCIe, not
> responding to incoming DMA, I guess.

It can be both. What differs is how the endpint gets the PCI address to read/write.

For an endpoint initiated transfer, typically, the PCI address is obtained from
some "command" received through a BAR or through DMA. The command has the PCI
addresses to use for transfering data to/from the host (e.g. an nvme rw command
uses PRPs or SGLs to specify the PCI address segments for the data buffer of a
command). For this case, the EPF driver calls calls pci_epc_mem_map() for the
command buffer, does the transfer (memcpy_toio/fromio()) and unmaps with
pci_epc_mem_unmap(). Note though that here, if an eDMA channel is used for the
transfer, the DMA engine will do the mapping automatically and the epf does not
need to call pci_epc_mem_map()/pci_epc_mem_unmap(). There is still an issue in
this area which is that it is *not* clear if the DMA channel used can actually
do the mapping automatically or not. E.g. the generic DMA channel (mem copy
offload engine) will not. So there is still some API improvement needed to
abstract more HW dependent things here.

For a RC initiated transfer, the transfers are not done to random addresses.
E.g. using NVMe CMB (controller memory buffer), the host will tell the endpoint
"I want to use the CMB with this PCI address". Then the endpoint can use
pci_epc_mem_map() to map the CMB to said PCI address and exchange data with the
host through it.

>>>   - These TLPs would be generated elsewhere in the PCIe fabric, e.g.,
>>>     by a driver on the host doing MMIO to the endpoint, or possibly
>>>     another endpoint doing peer-to-peer DMA.
>>
>> MMIO or DMA, yes.
>>
>>>   - Mem read/write TLPs are routed by address, and the endpoint
>>>     accepts them if the address matches one of its BARs.
>>
>> Yes, but pci_epc_mem_map() is not for use for BARs on the EP side.
>> BARs setup on EP PCI controllers use inbound ATU windows.
>> pci_epc_mem_map() programs outbound ATU windows. BARs setup use
>> pci_epf_alloc_space() + pci_epc_set_bar() to setup a BAR.
>>
>>>   - This is a little different from a Root Complex, which would
>>>     basically treat reads/writes to anything outside the Root Port
>>>     windows as incoming DMA headed to physical memory.
>>>
>>>   - A Root Complex would use the TLP address (the PCI bus address)
>>>     directly as a CPU physical memory address unless the TLP address
>>>     is translated by an IOMMU.
>>>
>>>   - For the endpoint, you want the BAR to be an aperture to physical
>>>     memory in the address space of the SoC driving the endpoint.
>>
>> Yes, and as said above this is done with pci_epf_alloc_space() +
>> pci_epc_set_bar(), not pci_epc_mem_map().
> 
> And if the endpoint is *initiating* DMA, BARs of that endpoint are not
> involved at all.  The target of the DMA would be either host memory or
> a BAR of another endpoint.

Yes.

>>>   - The SoC physical memory address may be different from the PCI but
>>>     address in the TLP, and pci_epc_mem_map() is the way to account
>>>     for this?
>>
>> pci_epc_mem_map() is essentially a call to pci_epc_mem_alloc_addr()
>> + pci_epc_map_addr(). pci_epc_mem_alloc_addr() allocates memory from
>> the EP PCI controller outbound windows and pci_epc_map_addr()
>> programs an EP PCI controller outbound ATU entry to map that memory
>> to some PCI address region of the host (RC).
> 
> I see "RC" and "RC address" used often in this area, but I don't quite
> understand the RC connection.  It sounds like this might be the PCI
> memory address space, i.e., the set of addresses that might appear in
> PCIe TLPs?

Yes. Sorry about that. PCI address == RC address in my head since the endpoint
does not "own" the PCI address space, the host RC does.

> Does "EPC addr space" refer to the physical address space of the
> processor operating the endpoint, or does it mean something more
> specific?

Correct. EPC address is the local memory address on the endpoint, so CPU
addresses. For something that maps to a RC PCI address, that generally mean
addresses withing the EP PCI controller outbound memory regions.
Niklas Cassel Oct. 23, 2024, 9:29 a.m. UTC | #16
On Wed, Oct 23, 2024 at 11:51:41AM +0900, Damien Le Moal wrote:

(snip)

> For an endpoint initiated transfer, typically, the PCI address is obtained from
> some "command" received through a BAR or through DMA. The command has the PCI
> addresses to use for transfering data to/from the host (e.g. an nvme rw command
> uses PRPs or SGLs to specify the PCI address segments for the data buffer of a
> command). For this case, the EPF driver calls calls pci_epc_mem_map() for the
> command buffer, does the transfer (memcpy_toio/fromio()) and unmaps with
> pci_epc_mem_unmap(). Note though that here, if an eDMA channel is used for the
> transfer, the DMA engine will do the mapping automatically and the epf does not
> need to call pci_epc_mem_map()/pci_epc_mem_unmap(). There is still an issue in
> this area which is that it is *not* clear if the DMA channel used can actually
> do the mapping automatically or not. E.g. the generic DMA channel (mem copy
> offload engine) will not. So there is still some API improvement needed to
> abstract more HW dependent things here.

FWIW, in my final reply here:
https://lore.kernel.org/lkml/ZiYuIaX7ZV0exKMt@ryzen/

"
I did suggest that DWC-based drivers could set a DMA_SLAVE_SKIP_MEM_MAP flag
or similar when registering the eDMA, which pci-epf-test then could check,
but I got no response if anyone else thought that this was a good idea.
"


For DMA_SLAVE (private tx/rx DMA channels):
For DWC-based controllers, we can definitely set DMA_SLAVE_SKIP_MEM_MAP when
registering the eDMA (e.g. in dw_pcie_edma_detect()).

However, I don't know how the DMA hardware (if any) in:
drivers/pci/controller/cadence/pcie-cadence-ep.c
drivers/pci/controller/pcie-rcar-ep.c
drivers/pci/controller/pcie-rockchip-ep.c
works, so I'm not sure if those drivers can set DMA_SLAVE_SKIP_MEM_MAP
(the safest thing is to not set that flag, until we know how they work).


For DMA_MEMCPY:
We know that we need to perform pci_epc_mem_map() when using
DMA API + "dummy" memcpy dma-channel (DMA_MEMCPY).
I could imagine that some embedded DMA controllers also provide
DMA_MEMCPY capabilities (in addition to DMA_SLAVE).


So I guess the safest thing is to call the flag something like:
PCI_EPC_DMA_SKIP_MEM_MAP
(rather than PCI_EPC_DMA_SLAVE_SKIP_MEM_MAP).
(since the embedded DMA controller might provide both DMA_SLAVE and DMA_MEMCPY).

And let the EPC driver (e.g. dw_pcie_edma_detect()), or possibly the DMA
driver itself to provide/set this flag.


Kind regards,
Niklas