mbox series

[RFC,rdma-next,00/10] RDMA/rxe: Add RDMA FLUSH operation

Message ID 20211228080717.10666-1-lizhijian@cn.fujitsu.com (mailing list archive)
Headers show
Series RDMA/rxe: Add RDMA FLUSH operation | expand

Message

Li Zhijian Dec. 28, 2021, 8:07 a.m. UTC
Hey folks,

These patches are going to implement a *NEW* RDMA opcode "RDMA FLUSH".
In IB SPEC 1.5[1][2], 2 new opcodes, ATOMIC WRITE and RDMA FLUSH were
added in the MEMORY PLACEMENT EXTENSIONS section.

FLUSH is used by the requesting node to achieve guarantees on the data
placement within the memory subsystem of preceding accesses to a
single memory region, such as those performed by RDMA WRITE, Atomics
and ATOMIC WRITE requests.

The operation indicates the virtual address space of a destination node
and where the guarantees should apply. This range must be contiguous
in the virtual space of the memory key but it is not necessarily a
contiguous range of physical memory.

FLUSH packets carry FLUSH extended transport header (see below) to
specify the placement type and the selectivity level of the operation
and RDMA extended header (RETH, see base document RETH definition) to
specify the R_Key VA and Length associated with this request following
the BTH in RC, RDETH in RD and XRCETH in XRC.

RC FLUSH:
+----+------+------+
|BTH | FETH | RETH |
+----+------+------+

RD FLUSH:
+----+------+------+------+
|BTH | RDETH| FETH | RETH |
+----+------+------+------+

XRC FLUSH:
+----+-------+------+------+
|BTH | XRCETH| FETH | RETH |
+----+-------+------+------+

Currently, we introduce RC and RD services only, since XRC has not been
implemented by rxe yet.
NOTE: only RC service is tested now, and since other HCAs have not
added/implemented FLUSH yet, we can only test FLUSH operation in both
SoftRoCE/rxe devices.

The corresponding rdma-core and FLUSH example are available on:
https://github.com/zhijianli88/rdma-core/tree/rfc

Below list some details about FLUSH transport packet:

A FLUSH message is built upon FLUSH request packet and is responded
successfully by RDMA READ response of zero size.

oA19-2: FLUSH shall be single packet message and shall have no payload.

oA19-2: FLUSH shall be single packet message and shall have no payload.
oA19-5: FLUSH BTH shall hold the Opcode = 0x1C

FLUSH Extended Transport Header(FETH)
+-----+-----------+------------------------+----------------------+
|Bits |   31-6    |          5-4           |        3-0           |
+-----+-----------+------------------------+----------------------+
|     | Reserved  | Selectivity Level(SEL) | Placement Type(PLT)  |
+-----+-----------+------------------------+----------------------+

Selectivity Level (SEL) – defines the memory region scope the FLUSH
should apply on. Values are as follows:
• b’00 - Memory Region Range: FLUSH applies for all preceding memory
         updates to the RETH range on this QP. All RETH fields shall be
         valid in this selectivity mode. RETH:DMALen field shall be
         between zero and (2 31 -1) bytes (inclusive).
• b’01 - Memory Region: FLUSH applies for all preceding memory up-
         dates to RETH.R_key on this QP. RETH:DMALen and RETH:VA
         shall be ignored in this mode.
• b'10 - Reserved.
• b'11 - Reserved.

Placement Type (PLT) – Defines the memory placement guarantee of
this FLUSH. Multiple bits may be set in this field. Values are as follows:
• Bit 0 if set to '1' indicated that the FLUSH should guarantee Global
  Visibility.
• Bit 1 if set to '1' indicated that the FLUSH should guarantee
  Persistence.
• Bits 3:2 are reserved

[1]: https://www.infinibandta.org/ibta-specification/ # login required
[2]: https://www.infinibandta.org/wp-content/uploads/2021/08/IBTA-Overview-of-IBTA-Volume-1-Release-1.5-and-MPE-2021-08-17-Secure.pptx

CC: Jason Gunthorpe <jgg@ziepe.ca>
CC: Zhu Yanjun <zyjzyj2000@gmail.com
CC: Leon Romanovsky <leon@kernel.org>
CC: Bob Pearson <rpearsonhpe@gmail.com>
CC: Weihang Li <liweihang@huawei.com>
CC: Mark Bloch <mbloch@nvidia.com>
CC: Wenpeng Liang <liangwenpeng@huawei.com>
CC: Aharon Landau <aharonl@nvidia.com>
CC: linux-rdma@vger.kernel.org
CC: linux-kernel@vger.kernel.org

Li Zhijian (10):
  RDMA: mr: Introduce is_pmem
  RDMA: Allow registering MR with flush access flags
  RDMA/rxe: Allow registering FLUSH flags for supported device only
  RDMA/rxe: Enable IB_DEVICE_RDMA_FLUSH for rxe device
  RDMA/rxe: Allow registering persistent flag for pmem MR only
  RDMA/rxe: Implement RC RDMA FLUSH service in requester side
  RDMA/rxe: Set BTH's SE to zero for FLUSH packet
  RDMA/rxe: Implement flush execution in responder side
  RDMA/rxe: Implement flush completion
  RDMA/rxe: Add RD FLUSH service support

 drivers/infiniband/core/uverbs_cmd.c    |  16 +++
 drivers/infiniband/sw/rxe/rxe_comp.c    |   4 +-
 drivers/infiniband/sw/rxe/rxe_hdr.h     |  52 ++++++++++
 drivers/infiniband/sw/rxe/rxe_loc.h     |   2 +
 drivers/infiniband/sw/rxe/rxe_mr.c      |  63 +++++++++++-
 drivers/infiniband/sw/rxe/rxe_opcode.c  |  33 ++++++
 drivers/infiniband/sw/rxe/rxe_opcode.h  |   3 +
 drivers/infiniband/sw/rxe/rxe_param.h   |   3 +-
 drivers/infiniband/sw/rxe/rxe_req.c     |  14 ++-
 drivers/infiniband/sw/rxe/rxe_resp.c    | 131 +++++++++++++++++++++++-
 include/rdma/ib_pack.h                  |   3 +
 include/rdma/ib_verbs.h                 |  22 +++-
 include/uapi/rdma/ib_user_ioctl_verbs.h |   2 +
 include/uapi/rdma/ib_user_verbs.h       |  18 ++++
 include/uapi/rdma/rdma_user_rxe.h       |   6 ++
 15 files changed, 362 insertions(+), 10 deletions(-)

Comments

Gromadzki, Tomasz Dec. 29, 2021, 8:49 a.m. UTC | #1
1)
> rdma_post_flush(struct rdma_cm_id *id, void *context, struct ibv_sge *sgl,
>		int nsge, int flags, uint32_t type, uint32_t level,
>		uint64_t remote_addr, uint32_t rkey)
Shall we not use struct ibv_sge *sgl in this API call but explicitly use flush length (DMALen) argument?
It will be similar to other rdma_post_XXX API calls where ibv_sge is not used at all.
Ibv_sge is used only in rdma_post_XXXv API calls.

2)
>struct ibv_send_wr {
>...
> 	union {
>		struct {
>			uint64_t	remote_addr;
>			uint32_t	rkey;
>			uint32_t	type;
>			uint32_t	level;
>		} flush;

Shall we extend this structure with 
uint32_t length
and abandon using *sg_list as it is related in RDMA verbs to local memory access only:

Ibv_post_send.3:
struct ibv_sge         *sg_list;                /* Pointer to the s/g array */
int                     num_sge;                /* Size of the s/g array */

In the case of flush, there is no local context at all.

Thanks
Tomasz

> -----Original Message-----
> From: Li Zhijian <lizhijian@cn.fujitsu.com>
> Sent: Tuesday, December 28, 2021 9:07 AM
> To: linux-rdma@vger.kernel.org; zyjzyj2000@gmail.com; jgg@ziepe.ca;
> aharonl@nvidia.com; leon@kernel.org
> Cc: linux-kernel@vger.kernel.org; mbloch@nvidia.com;
> liweihang@huawei.com; liangwenpeng@huawei.com;
> yangx.jy@cn.fujitsu.com; rpearsonhpe@gmail.com; y-goto@fujitsu.com; Li
> Zhijian <lizhijian@cn.fujitsu.com>
> Subject: [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH
> operation
> 
> Hey folks,
> 
> These patches are going to implement a *NEW* RDMA opcode "RDMA
> FLUSH".
> In IB SPEC 1.5[1][2], 2 new opcodes, ATOMIC WRITE and RDMA FLUSH were
> added in the MEMORY PLACEMENT EXTENSIONS section.
> 
> FLUSH is used by the requesting node to achieve guarantees on the data
> placement within the memory subsystem of preceding accesses to a single
> memory region, such as those performed by RDMA WRITE, Atomics and
> ATOMIC WRITE requests.
> 
> The operation indicates the virtual address space of a destination node and
> where the guarantees should apply. This range must be contiguous in the
> virtual space of the memory key but it is not necessarily a contiguous range
> of physical memory.
> 
> FLUSH packets carry FLUSH extended transport header (see below) to
> specify the placement type and the selectivity level of the operation and
> RDMA extended header (RETH, see base document RETH definition) to
> specify the R_Key VA and Length associated with this request following the
> BTH in RC, RDETH in RD and XRCETH in XRC.
> 
> RC FLUSH:
> +----+------+------+
> |BTH | FETH | RETH |
> +----+------+------+
> 
> RD FLUSH:
> +----+------+------+------+
> |BTH | RDETH| FETH | RETH |
> +----+------+------+------+
> 
> XRC FLUSH:
> +----+-------+------+------+
> |BTH | XRCETH| FETH | RETH |
> +----+-------+------+------+
> 
> Currently, we introduce RC and RD services only, since XRC has not been
> implemented by rxe yet.
> NOTE: only RC service is tested now, and since other HCAs have not
> added/implemented FLUSH yet, we can only test FLUSH operation in both
> SoftRoCE/rxe devices.
> 
> The corresponding rdma-core and FLUSH example are available on:
> https://github.com/zhijianli88/rdma-core/tree/rfc
> 
> Below list some details about FLUSH transport packet:
> 
> A FLUSH message is built upon FLUSH request packet and is responded
> successfully by RDMA READ response of zero size.
> 
> oA19-2: FLUSH shall be single packet message and shall have no payload.
> 
> oA19-2: FLUSH shall be single packet message and shall have no payload.
> oA19-5: FLUSH BTH shall hold the Opcode = 0x1C
> 
> FLUSH Extended Transport Header(FETH)
> +-----+-----------+------------------------+----------------------+
> |Bits |   31-6    |          5-4           |        3-0           |
> +-----+-----------+------------------------+----------------------+
> |     | Reserved  | Selectivity Level(SEL) | Placement Type(PLT)  |
> +-----+-----------+------------------------+----------------------+
> 
> Selectivity Level (SEL) – defines the memory region scope the FLUSH should
> apply on. Values are as follows:
> • b’00 - Memory Region Range: FLUSH applies for all preceding memory
>          updates to the RETH range on this QP. All RETH fields shall be
>          valid in this selectivity mode. RETH:DMALen field shall be
>          between zero and (2 31 -1) bytes (inclusive).
> • b’01 - Memory Region: FLUSH applies for all preceding memory up-
>          dates to RETH.R_key on this QP. RETH:DMALen and RETH:VA
>          shall be ignored in this mode.
> • b'10 - Reserved.
> • b'11 - Reserved.
> 
> Placement Type (PLT) – Defines the memory placement guarantee of this
> FLUSH. Multiple bits may be set in this field. Values are as follows:
> • Bit 0 if set to '1' indicated that the FLUSH should guarantee Global
>   Visibility.
> • Bit 1 if set to '1' indicated that the FLUSH should guarantee
>   Persistence.
> • Bits 3:2 are reserved
> 
> [1]: https://www.infinibandta.org/ibta-specification/ # login required
> [2]: https://www.infinibandta.org/wp-content/uploads/2021/08/IBTA-
> Overview-of-IBTA-Volume-1-Release-1.5-and-MPE-2021-08-17-Secure.pptx
> 
> CC: Jason Gunthorpe <jgg@ziepe.ca>
> CC: Zhu Yanjun <zyjzyj2000@gmail.com
> CC: Leon Romanovsky <leon@kernel.org>
> CC: Bob Pearson <rpearsonhpe@gmail.com>
> CC: Weihang Li <liweihang@huawei.com>
> CC: Mark Bloch <mbloch@nvidia.com>
> CC: Wenpeng Liang <liangwenpeng@huawei.com>
> CC: Aharon Landau <aharonl@nvidia.com>
> CC: linux-rdma@vger.kernel.org
> CC: linux-kernel@vger.kernel.org
> 
> Li Zhijian (10):
>   RDMA: mr: Introduce is_pmem
>   RDMA: Allow registering MR with flush access flags
>   RDMA/rxe: Allow registering FLUSH flags for supported device only
>   RDMA/rxe: Enable IB_DEVICE_RDMA_FLUSH for rxe device
>   RDMA/rxe: Allow registering persistent flag for pmem MR only
>   RDMA/rxe: Implement RC RDMA FLUSH service in requester side
>   RDMA/rxe: Set BTH's SE to zero for FLUSH packet
>   RDMA/rxe: Implement flush execution in responder side
>   RDMA/rxe: Implement flush completion
>   RDMA/rxe: Add RD FLUSH service support
> 
>  drivers/infiniband/core/uverbs_cmd.c    |  16 +++
>  drivers/infiniband/sw/rxe/rxe_comp.c    |   4 +-
>  drivers/infiniband/sw/rxe/rxe_hdr.h     |  52 ++++++++++
>  drivers/infiniband/sw/rxe/rxe_loc.h     |   2 +
>  drivers/infiniband/sw/rxe/rxe_mr.c      |  63 +++++++++++-
>  drivers/infiniband/sw/rxe/rxe_opcode.c  |  33 ++++++
>  drivers/infiniband/sw/rxe/rxe_opcode.h  |   3 +
>  drivers/infiniband/sw/rxe/rxe_param.h   |   3 +-
>  drivers/infiniband/sw/rxe/rxe_req.c     |  14 ++-
>  drivers/infiniband/sw/rxe/rxe_resp.c    | 131 +++++++++++++++++++++++-
>  include/rdma/ib_pack.h                  |   3 +
>  include/rdma/ib_verbs.h                 |  22 +++-
>  include/uapi/rdma/ib_user_ioctl_verbs.h |   2 +
>  include/uapi/rdma/ib_user_verbs.h       |  18 ++++
>  include/uapi/rdma/rdma_user_rxe.h       |   6 ++
>  15 files changed, 362 insertions(+), 10 deletions(-)
> 
> --
> 2.31.1
> 
> 

---------------------------------------------------------------------
Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | Kapital zakladowy 200.000 PLN.
Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i moze zawierac informacje poufne. W razie przypadkowego otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; jakiekolwiek przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.
Tom Talpey Dec. 29, 2021, 2:35 p.m. UTC | #2
On 12/29/2021 3:49 AM, Gromadzki, Tomasz wrote:
> 1)
>> rdma_post_flush(struct rdma_cm_id *id, void *context, struct ibv_sge *sgl,
>> 		int nsge, int flags, uint32_t type, uint32_t level,
>> 		uint64_t remote_addr, uint32_t rkey)
> Shall we not use struct ibv_sge *sgl in this API call but explicitly use flush length (DMALen) argument?
> It will be similar to other rdma_post_XXX API calls where ibv_sge is not used at all.
> Ibv_sge is used only in rdma_post_XXXv API calls.

I agree. The Flush operation does not access any local memory and
therefore an SGE is inappropriate. Better to pass the actual
Flush parameters, i.e. add the length. See next comment.

> 2)
>> struct ibv_send_wr {
>> ...
>> 	union {
>> 		struct {
>> 			uint64_t	remote_addr;
>> 			uint32_t	rkey;
>> 			uint32_t	type;
>> 			uint32_t	level;
>> 		} flush;
> 
> Shall we extend this structure with
> uint32_t length

Also agree. The remote length of the subregion to be flushed is
a required parameter and should be passed here.

Is it really necessary to promote the type and level to full 32-bit
fields in the API? In the protocol, these are just 4 bits and 2 bits,
respectively, and are packed together into a singe 32-bit FETH field.

Tom.

> and abandon using *sg_list as it is related in RDMA verbs to local memory access only:
> 
> Ibv_post_send.3:
> struct ibv_sge         *sg_list;                /* Pointer to the s/g array */
> int                     num_sge;                /* Size of the s/g array */
> 
> In the case of flush, there is no local context at all.
> 
> Thanks
> Tomasz
> 
>> -----Original Message-----
>> From: Li Zhijian <lizhijian@cn.fujitsu.com>
>> Sent: Tuesday, December 28, 2021 9:07 AM
>> To: linux-rdma@vger.kernel.org; zyjzyj2000@gmail.com; jgg@ziepe.ca;
>> aharonl@nvidia.com; leon@kernel.org
>> Cc: linux-kernel@vger.kernel.org; mbloch@nvidia.com;
>> liweihang@huawei.com; liangwenpeng@huawei.com;
>> yangx.jy@cn.fujitsu.com; rpearsonhpe@gmail.com; y-goto@fujitsu.com; Li
>> Zhijian <lizhijian@cn.fujitsu.com>
>> Subject: [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH
>> operation
>>
>> Hey folks,
>>
>> These patches are going to implement a *NEW* RDMA opcode "RDMA
>> FLUSH".
>> In IB SPEC 1.5[1][2], 2 new opcodes, ATOMIC WRITE and RDMA FLUSH were
>> added in the MEMORY PLACEMENT EXTENSIONS section.
>>
>> FLUSH is used by the requesting node to achieve guarantees on the data
>> placement within the memory subsystem of preceding accesses to a single
>> memory region, such as those performed by RDMA WRITE, Atomics and
>> ATOMIC WRITE requests.
>>
>> The operation indicates the virtual address space of a destination node and
>> where the guarantees should apply. This range must be contiguous in the
>> virtual space of the memory key but it is not necessarily a contiguous range
>> of physical memory.
>>
>> FLUSH packets carry FLUSH extended transport header (see below) to
>> specify the placement type and the selectivity level of the operation and
>> RDMA extended header (RETH, see base document RETH definition) to
>> specify the R_Key VA and Length associated with this request following the
>> BTH in RC, RDETH in RD and XRCETH in XRC.
>>
>> RC FLUSH:
>> +----+------+------+
>> |BTH | FETH | RETH |
>> +----+------+------+
>>
>> RD FLUSH:
>> +----+------+------+------+
>> |BTH | RDETH| FETH | RETH |
>> +----+------+------+------+
>>
>> XRC FLUSH:
>> +----+-------+------+------+
>> |BTH | XRCETH| FETH | RETH |
>> +----+-------+------+------+
>>
>> Currently, we introduce RC and RD services only, since XRC has not been
>> implemented by rxe yet.
>> NOTE: only RC service is tested now, and since other HCAs have not
>> added/implemented FLUSH yet, we can only test FLUSH operation in both
>> SoftRoCE/rxe devices.
>>
>> The corresponding rdma-core and FLUSH example are available on:
>> https://github.com/zhijianli88/rdma-core/tree/rfc
>>
>> Below list some details about FLUSH transport packet:
>>
>> A FLUSH message is built upon FLUSH request packet and is responded
>> successfully by RDMA READ response of zero size.
>>
>> oA19-2: FLUSH shall be single packet message and shall have no payload.
>>
>> oA19-2: FLUSH shall be single packet message and shall have no payload.
>> oA19-5: FLUSH BTH shall hold the Opcode = 0x1C
>>
>> FLUSH Extended Transport Header(FETH)
>> +-----+-----------+------------------------+----------------------+
>> |Bits |   31-6    |          5-4           |        3-0           |
>> +-----+-----------+------------------------+----------------------+
>> |     | Reserved  | Selectivity Level(SEL) | Placement Type(PLT)  |
>> +-----+-----------+------------------------+----------------------+
>>
>> Selectivity Level (SEL) – defines the memory region scope the FLUSH should
>> apply on. Values are as follows:
>> • b’00 - Memory Region Range: FLUSH applies for all preceding memory
>>           updates to the RETH range on this QP. All RETH fields shall be
>>           valid in this selectivity mode. RETH:DMALen field shall be
>>           between zero and (2 31 -1) bytes (inclusive).
>> • b’01 - Memory Region: FLUSH applies for all preceding memory up-
>>           dates to RETH.R_key on this QP. RETH:DMALen and RETH:VA
>>           shall be ignored in this mode.
>> • b'10 - Reserved.
>> • b'11 - Reserved.
>>
>> Placement Type (PLT) – Defines the memory placement guarantee of this
>> FLUSH. Multiple bits may be set in this field. Values are as follows:
>> • Bit 0 if set to '1' indicated that the FLUSH should guarantee Global
>>    Visibility.
>> • Bit 1 if set to '1' indicated that the FLUSH should guarantee
>>    Persistence.
>> • Bits 3:2 are reserved
>>
>> [1]: https://www.infinibandta.org/ibta-specification/ # login required
>> [2]: https://www.infinibandta.org/wp-content/uploads/2021/08/IBTA-
>> Overview-of-IBTA-Volume-1-Release-1.5-and-MPE-2021-08-17-Secure.pptx
>>
>> CC: Jason Gunthorpe <jgg@ziepe.ca>
>> CC: Zhu Yanjun <zyjzyj2000@gmail.com
>> CC: Leon Romanovsky <leon@kernel.org>
>> CC: Bob Pearson <rpearsonhpe@gmail.com>
>> CC: Weihang Li <liweihang@huawei.com>
>> CC: Mark Bloch <mbloch@nvidia.com>
>> CC: Wenpeng Liang <liangwenpeng@huawei.com>
>> CC: Aharon Landau <aharonl@nvidia.com>
>> CC: linux-rdma@vger.kernel.org
>> CC: linux-kernel@vger.kernel.org
>>
>> Li Zhijian (10):
>>    RDMA: mr: Introduce is_pmem
>>    RDMA: Allow registering MR with flush access flags
>>    RDMA/rxe: Allow registering FLUSH flags for supported device only
>>    RDMA/rxe: Enable IB_DEVICE_RDMA_FLUSH for rxe device
>>    RDMA/rxe: Allow registering persistent flag for pmem MR only
>>    RDMA/rxe: Implement RC RDMA FLUSH service in requester side
>>    RDMA/rxe: Set BTH's SE to zero for FLUSH packet
>>    RDMA/rxe: Implement flush execution in responder side
>>    RDMA/rxe: Implement flush completion
>>    RDMA/rxe: Add RD FLUSH service support
>>
>>   drivers/infiniband/core/uverbs_cmd.c    |  16 +++
>>   drivers/infiniband/sw/rxe/rxe_comp.c    |   4 +-
>>   drivers/infiniband/sw/rxe/rxe_hdr.h     |  52 ++++++++++
>>   drivers/infiniband/sw/rxe/rxe_loc.h     |   2 +
>>   drivers/infiniband/sw/rxe/rxe_mr.c      |  63 +++++++++++-
>>   drivers/infiniband/sw/rxe/rxe_opcode.c  |  33 ++++++
>>   drivers/infiniband/sw/rxe/rxe_opcode.h  |   3 +
>>   drivers/infiniband/sw/rxe/rxe_param.h   |   3 +-
>>   drivers/infiniband/sw/rxe/rxe_req.c     |  14 ++-
>>   drivers/infiniband/sw/rxe/rxe_resp.c    | 131 +++++++++++++++++++++++-
>>   include/rdma/ib_pack.h                  |   3 +
>>   include/rdma/ib_verbs.h                 |  22 +++-
>>   include/uapi/rdma/ib_user_ioctl_verbs.h |   2 +
>>   include/uapi/rdma/ib_user_verbs.h       |  18 ++++
>>   include/uapi/rdma/rdma_user_rxe.h       |   6 ++
>>   15 files changed, 362 insertions(+), 10 deletions(-)
>>
>> --
>> 2.31.1
>>
>>
> 
> ---------------------------------------------------------------------
> Intel Technology Poland sp. z o.o.
> ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | Kapital zakladowy 200.000 PLN.
> Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i moze zawierac informacje poufne. W razie przypadkowego otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; jakiekolwiek przegladanie lub rozpowszechnianie jest zabronione.
> This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.
Zhijian Li (Fujitsu) Dec. 31, 2021, 1:10 a.m. UTC | #3
Tom, Tomasz

All you reviewing are greatly appreciated.


On 29/12/2021 22:35, Tom Talpey wrote:
> On 12/29/2021 3:49 AM, Gromadzki, Tomasz wrote:
>> 1)
>>> rdma_post_flush(struct rdma_cm_id *id, void *context, struct ibv_sge *sgl,
>>>         int nsge, int flags, uint32_t type, uint32_t level,
>>>         uint64_t remote_addr, uint32_t rkey)
>> Shall we not use struct ibv_sge *sgl in this API call but explicitly use flush length (DMALen) argument?
>> It will be similar to other rdma_post_XXX API calls where ibv_sge is not used at all.
>> Ibv_sge is used only in rdma_post_XXXv API calls.
Indeed, i missed rdma_post_XXX and rdma_post_XXXv



>
> I agree. The Flush operation does not access any local memory and
> therefore an SGE is inappropriate. Better to pass the actual
> Flush parameters, i.e. add the length. See next comment.

It sounds good.


>
>> 2)
>>> struct ibv_send_wr {
>>> ...
>>>     union {
>>>         struct {
>>>             uint64_t    remote_addr;
>>>             uint32_t    rkey;
>>>             uint32_t    type;
>>>             uint32_t    level;
>>>         } flush;
>>
>> Shall we extend this structure with
>> uint32_t length
>
> Also agree. The remote length of the subregion to be flushed is
> a required parameter and should be passed here.

Agree.


>
> Is it really necessary to promote the type and level to full 32-bit
> fields in the API? In the protocol, these are just 4 bits and 2 bits,
> respectively, and are packed together into a singe 32-bit FETH field.
good catch, i will update both them to u8 instead.



>
> Tom.
>
>> and abandon using *sg_list as it is related in RDMA verbs to local memory access only:

oh, good to know that, it's new knowledge to me. very thanks.


>>
>> Ibv_post_send.3:
>> struct ibv_sge         *sg_list;                /* Pointer to the s/g array */
>> int                     num_sge;                /* Size of the s/g array */
>>
>> In the case of flush, there is no local context at all.

Got it.


Thanks
Zhijian

>>
>> Thanks
>> Tomasz
>>
>>> -----Original Message-----
>>> From: Li Zhijian <lizhijian@cn.fujitsu.com>
>>> Sent: Tuesday, December 28, 2021 9:07 AM
>>> To: linux-rdma@vger.kernel.org; zyjzyj2000@gmail.com; jgg@ziepe.ca;
>>> aharonl@nvidia.com; leon@kernel.org
>>> Cc: linux-kernel@vger.kernel.org; mbloch@nvidia.com;
>>> liweihang@huawei.com; liangwenpeng@huawei.com;
>>> yangx.jy@cn.fujitsu.com; rpearsonhpe@gmail.com; y-goto@fujitsu.com; Li
>>> Zhijian <lizhijian@cn.fujitsu.com>
>>> Subject: [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH
>>> operation
>>>
>>> Hey folks,
>>>
>>> These patches are going to implement a *NEW* RDMA opcode "RDMA
>>> FLUSH".
>>> In IB SPEC 1.5[1][2], 2 new opcodes, ATOMIC WRITE and RDMA FLUSH were
>>> added in the MEMORY PLACEMENT EXTENSIONS section.
>>>
>>> FLUSH is used by the requesting node to achieve guarantees on the data
>>> placement within the memory subsystem of preceding accesses to a single
>>> memory region, such as those performed by RDMA WRITE, Atomics and
>>> ATOMIC WRITE requests.
>>>
>>> The operation indicates the virtual address space of a destination node and
>>> where the guarantees should apply. This range must be contiguous in the
>>> virtual space of the memory key but it is not necessarily a contiguous range
>>> of physical memory.
>>>
>>> FLUSH packets carry FLUSH extended transport header (see below) to
>>> specify the placement type and the selectivity level of the operation and
>>> RDMA extended header (RETH, see base document RETH definition) to
>>> specify the R_Key VA and Length associated with this request following the
>>> BTH in RC, RDETH in RD and XRCETH in XRC.
>>>
>>> RC FLUSH:
>>> +----+------+------+
>>> |BTH | FETH | RETH |
>>> +----+------+------+
>>>
>>> RD FLUSH:
>>> +----+------+------+------+
>>> |BTH | RDETH| FETH | RETH |
>>> +----+------+------+------+
>>>
>>> XRC FLUSH:
>>> +----+-------+------+------+
>>> |BTH | XRCETH| FETH | RETH |
>>> +----+-------+------+------+
>>>
>>> Currently, we introduce RC and RD services only, since XRC has not been
>>> implemented by rxe yet.
>>> NOTE: only RC service is tested now, and since other HCAs have not
>>> added/implemented FLUSH yet, we can only test FLUSH operation in both
>>> SoftRoCE/rxe devices.
>>>
>>> The corresponding rdma-core and FLUSH example are available on:
>>> https://github.com/zhijianli88/rdma-core/tree/rfc
>>>
>>> Below list some details about FLUSH transport packet:
>>>
>>> A FLUSH message is built upon FLUSH request packet and is responded
>>> successfully by RDMA READ response of zero size.
>>>
>>> oA19-2: FLUSH shall be single packet message and shall have no payload.
>>>
>>> oA19-2: FLUSH shall be single packet message and shall have no payload.
>>> oA19-5: FLUSH BTH shall hold the Opcode = 0x1C
>>>
>>> FLUSH Extended Transport Header(FETH)
>>> +-----+-----------+------------------------+----------------------+
>>> |Bits |   31-6    |          5-4           | 3-0           |
>>> +-----+-----------+------------------------+----------------------+
>>> |     | Reserved  | Selectivity Level(SEL) | Placement Type(PLT)  |
>>> +-----+-----------+------------------------+----------------------+
>>>
>>> Selectivity Level (SEL) – defines the memory region scope the FLUSH should
>>> apply on. Values are as follows:
>>> • b’00 - Memory Region Range: FLUSH applies for all preceding memory
>>>           updates to the RETH range on this QP. All RETH fields shall be
>>>           valid in this selectivity mode. RETH:DMALen field shall be
>>>           between zero and (2 31 -1) bytes (inclusive).
>>> • b’01 - Memory Region: FLUSH applies for all preceding memory up-
>>>           dates to RETH.R_key on this QP. RETH:DMALen and RETH:VA
>>>           shall be ignored in this mode.
>>> • b'10 - Reserved.
>>> • b'11 - Reserved.
>>>
>>> Placement Type (PLT) – Defines the memory placement guarantee of this
>>> FLUSH. Multiple bits may be set in this field. Values are as follows:
>>> • Bit 0 if set to '1' indicated that the FLUSH should guarantee Global
>>>    Visibility.
>>> • Bit 1 if set to '1' indicated that the FLUSH should guarantee
>>>    Persistence.
>>> • Bits 3:2 are reserved
>>>
>>> [1]: https://www.infinibandta.org/ibta-specification/ # login required
>>> [2]: https://www.infinibandta.org/wp-content/uploads/2021/08/IBTA-
>>> Overview-of-IBTA-Volume-1-Release-1.5-and-MPE-2021-08-17-Secure.pptx
>>>
>>> CC: Jason Gunthorpe <jgg@ziepe.ca>
>>> CC: Zhu Yanjun <zyjzyj2000@gmail.com
>>> CC: Leon Romanovsky <leon@kernel.org>
>>> CC: Bob Pearson <rpearsonhpe@gmail.com>
>>> CC: Weihang Li <liweihang@huawei.com>
>>> CC: Mark Bloch <mbloch@nvidia.com>
>>> CC: Wenpeng Liang <liangwenpeng@huawei.com>
>>> CC: Aharon Landau <aharonl@nvidia.com>
>>> CC: linux-rdma@vger.kernel.org
>>> CC: linux-kernel@vger.kernel.org
>>>
>>> Li Zhijian (10):
>>>    RDMA: mr: Introduce is_pmem
>>>    RDMA: Allow registering MR with flush access flags
>>>    RDMA/rxe: Allow registering FLUSH flags for supported device only
>>>    RDMA/rxe: Enable IB_DEVICE_RDMA_FLUSH for rxe device
>>>    RDMA/rxe: Allow registering persistent flag for pmem MR only
>>>    RDMA/rxe: Implement RC RDMA FLUSH service in requester side
>>>    RDMA/rxe: Set BTH's SE to zero for FLUSH packet
>>>    RDMA/rxe: Implement flush execution in responder side
>>>    RDMA/rxe: Implement flush completion
>>>    RDMA/rxe: Add RD FLUSH service support
>>>
>>>   drivers/infiniband/core/uverbs_cmd.c    |  16 +++
>>>   drivers/infiniband/sw/rxe/rxe_comp.c    |   4 +-
>>>   drivers/infiniband/sw/rxe/rxe_hdr.h     |  52 ++++++++++
>>>   drivers/infiniband/sw/rxe/rxe_loc.h     |   2 +
>>>   drivers/infiniband/sw/rxe/rxe_mr.c      |  63 +++++++++++-
>>>   drivers/infiniband/sw/rxe/rxe_opcode.c  |  33 ++++++
>>>   drivers/infiniband/sw/rxe/rxe_opcode.h  |   3 +
>>>   drivers/infiniband/sw/rxe/rxe_param.h   |   3 +-
>>>   drivers/infiniband/sw/rxe/rxe_req.c     |  14 ++-
>>>   drivers/infiniband/sw/rxe/rxe_resp.c    | 131 +++++++++++++++++++++++-
>>>   include/rdma/ib_pack.h                  |   3 +
>>>   include/rdma/ib_verbs.h                 |  22 +++-
>>>   include/uapi/rdma/ib_user_ioctl_verbs.h |   2 +
>>>   include/uapi/rdma/ib_user_verbs.h       |  18 ++++
>>>   include/uapi/rdma/rdma_user_rxe.h       |   6 ++
>>>   15 files changed, 362 insertions(+), 10 deletions(-)
>>>
>>> -- 
>>> 2.31.1
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> Intel Technology Poland sp. z o.o.
>> ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | Kapital zakladowy 200.000 PLN.
>> Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i moze zawierac informacje poufne. W razie przypadkowego otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; jakiekolwiek przegladanie lub rozpowszechnianie jest zabronione.
>> This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.