diff mbox series

[RFC,v2,2/2] RDMA/rxe: Support RDMA Atomic Write operation

Message ID 20220113030350.2492841-3-yangx.jy@fujitsu.com (mailing list archive)
State Superseded
Headers show
Series RDMA/rxe: Add RDMA Atomic Write operation | expand

Commit Message

Xiao Yang Jan. 13, 2022, 3:03 a.m. UTC
This patch implements RDMA Atomic Write operation for RC service.

Signed-off-by: Xiao Yang <yangx.jy@fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe_comp.c   |  4 +++
 drivers/infiniband/sw/rxe/rxe_opcode.c | 18 +++++++++++
 drivers/infiniband/sw/rxe/rxe_opcode.h |  3 ++
 drivers/infiniband/sw/rxe/rxe_qp.c     |  3 +-
 drivers/infiniband/sw/rxe/rxe_req.c    | 11 +++++--
 drivers/infiniband/sw/rxe/rxe_resp.c   | 43 +++++++++++++++++++++-----
 include/rdma/ib_pack.h                 |  2 ++
 include/rdma/ib_verbs.h                |  2 ++
 include/uapi/rdma/ib_user_verbs.h      |  2 ++
 include/uapi/rdma/rdma_user_rxe.h      |  1 +
 10 files changed, 78 insertions(+), 11 deletions(-)

Comments

Jason Gunthorpe Jan. 17, 2022, 1:16 p.m. UTC | #1
On Thu, Jan 13, 2022 at 11:03:50AM +0800, Xiao Yang wrote:
> +static enum resp_states process_atomic_write(struct rxe_qp *qp,
> +					     struct rxe_pkt_info *pkt)
> +{
> +	struct rxe_mr *mr = qp->resp.mr;
> +
> +	u64 *src = payload_addr(pkt);
> +
> +	u64 *dst = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset, sizeof(u64));
> +	if (!dst || (uintptr_t)dst & 7)
> +		return RESPST_ERR_MISALIGNED_ATOMIC;

It looks to me like iova_to_vaddr is completely broken, where is the
kmap on that flow?

Jason
Xiao Yang Jan. 18, 2022, 8:01 a.m. UTC | #2
On 2022/1/17 21:16, Jason Gunthorpe wrote:
> On Thu, Jan 13, 2022 at 11:03:50AM +0800, Xiao Yang wrote:
>> +static enum resp_states process_atomic_write(struct rxe_qp *qp,
>> +					     struct rxe_pkt_info *pkt)
>> +{
>> +	struct rxe_mr *mr = qp->resp.mr;
>> +
>> +	u64 *src = payload_addr(pkt);
>> +
>> +	u64 *dst = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset, sizeof(u64));
>> +	if (!dst || (uintptr_t)dst&  7)
>> +		return RESPST_ERR_MISALIGNED_ATOMIC;
> It looks to me like iova_to_vaddr is completely broken, where is the
> kmap on that flow?
Hi Jason,

I think rxe_mr_init_user() maps the user addr space to the kernel addr 
space during memory region registration, the mapping records are saved 
into mr->cur_map_set->map[x].
Why do you think iova_to_vaddr() is completely broken?

Best Regards,
Xiao Yang
> Jason
Xiao Yang Jan. 18, 2022, 8:02 a.m. UTC | #3
On 2022/1/17 21:16, Jason Gunthorpe wrote:
> On Thu, Jan 13, 2022 at 11:03:50AM +0800, Xiao Yang wrote:
>> +static enum resp_states process_atomic_write(struct rxe_qp *qp,
>> +					     struct rxe_pkt_info *pkt)
>> +{
>> +	struct rxe_mr *mr = qp->resp.mr;
>> +
>> +	u64 *src = payload_addr(pkt);
>> +
>> +	u64 *dst = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset, sizeof(u64));
>> +	if (!dst || (uintptr_t)dst&  7)
>> +		return RESPST_ERR_MISALIGNED_ATOMIC;
> It looks to me like iova_to_vaddr is completely broken, where is the
> kmap on that flow?
Hi Jason,

I think rxe_mr_init_user() maps the user addr space to the kernel addr 
space during memory region registration, the mapping records are saved 
into mr->cur_map_set->map[x].
Why do you think iova_to_vaddr() is completely broken?

Best Regards,
Xiao Yang
> Jason
Xiao Yang Jan. 18, 2022, 8:04 a.m. UTC | #4
On 2022/1/17 21:16, Jason Gunthorpe wrote:
> On Thu, Jan 13, 2022 at 11:03:50AM +0800, Xiao Yang wrote:
>> +static enum resp_states process_atomic_write(struct rxe_qp *qp,
>> +					     struct rxe_pkt_info *pkt)
>> +{
>> +	struct rxe_mr *mr = qp->resp.mr;
>> +
>> +	u64 *src = payload_addr(pkt);
>> +
>> +	u64 *dst = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset, sizeof(u64));
>> +	if (!dst || (uintptr_t)dst&  7)
>> +		return RESPST_ERR_MISALIGNED_ATOMIC;
> It looks to me like iova_to_vaddr is completely broken, where is the
> kmap on that flow?
Hi Jason,

I think rxe_mr_init_user() maps the user addr space to the kernel addr 
space during memory region registration, the mapping records are saved 
into mr->cur_map_set->map[x].
Why do you think iova_to_vaddr() is completely broken?

Best Regards,
Xiao Yang
> Jason
Xiao Yang Jan. 18, 2022, 9:03 a.m. UTC | #5
Hi,

Sorry for the duplicate replies.

Best Regards,
Xiao Yang
On 2022/1/18 16:04, yangx.jy@fujitsu.com wrote:
> On 2022/1/17 21:16, Jason Gunthorpe wrote:
>> On Thu, Jan 13, 2022 at 11:03:50AM +0800, Xiao Yang wrote:
>>> +static enum resp_states process_atomic_write(struct rxe_qp *qp,
>>> +					     struct rxe_pkt_info *pkt)
>>> +{
>>> +	struct rxe_mr *mr = qp->resp.mr;
>>> +
>>> +	u64 *src = payload_addr(pkt);
>>> +
>>> +	u64 *dst = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset, sizeof(u64));
>>> +	if (!dst || (uintptr_t)dst&  7)
>>> +		return RESPST_ERR_MISALIGNED_ATOMIC;
>> It looks to me like iova_to_vaddr is completely broken, where is the
>> kmap on that flow?
> Hi Jason,
>
> I think rxe_mr_init_user() maps the user addr space to the kernel addr 
> space during memory region registration, the mapping records are saved 
> into mr->cur_map_set->map[x].
> Why do you think iova_to_vaddr() is completely broken?
>
> Best Regards,
> Xiao Yang
>> Jason
Jason Gunthorpe Jan. 18, 2022, 12:35 p.m. UTC | #6
On Tue, Jan 18, 2022 at 08:01:59AM +0000, yangx.jy@fujitsu.com wrote:
> On 2022/1/17 21:16, Jason Gunthorpe wrote:
> > On Thu, Jan 13, 2022 at 11:03:50AM +0800, Xiao Yang wrote:
> >> +static enum resp_states process_atomic_write(struct rxe_qp *qp,
> >> +					     struct rxe_pkt_info *pkt)
> >> +{
> >> +	struct rxe_mr *mr = qp->resp.mr;
> >> +
> >> +	u64 *src = payload_addr(pkt);
> >> +
> >> +	u64 *dst = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset, sizeof(u64));
> >> +	if (!dst || (uintptr_t)dst&  7)
> >> +		return RESPST_ERR_MISALIGNED_ATOMIC;
> > It looks to me like iova_to_vaddr is completely broken, where is the
> > kmap on that flow?
> Hi Jason,
> 
> I think rxe_mr_init_user() maps the user addr space to the kernel addr 
> space during memory region registration, the mapping records are saved 
> into mr->cur_map_set->map[x].

There is no way to touch user memory from the CPU in the kernel
without calling one of the kmap's, so I don't know what this thinks it
is doing.

Jason
Zhijian Li (Fujitsu) Jan. 19, 2022, 1:54 a.m. UTC | #7
On 18/01/2022 20:35, Jason Gunthorpe wrote:
> On Tue, Jan 18, 2022 at 08:01:59AM +0000, yangx.jy@fujitsu.com wrote:
>> On 2022/1/17 21:16, Jason Gunthorpe wrote:
>>> On Thu, Jan 13, 2022 at 11:03:50AM +0800, Xiao Yang wrote:
>>>> +static enum resp_states process_atomic_write(struct rxe_qp *qp,
>>>> +					     struct rxe_pkt_info *pkt)
>>>> +{
>>>> +	struct rxe_mr *mr = qp->resp.mr;
>>>> +
>>>> +	u64 *src = payload_addr(pkt);
>>>> +
>>>> +	u64 *dst = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset, sizeof(u64));
>>>> +	if (!dst || (uintptr_t)dst&  7)
>>>> +		return RESPST_ERR_MISALIGNED_ATOMIC;
>>> It looks to me like iova_to_vaddr is completely broken, where is the
>>> kmap on that flow?
>> Hi Jason,
>>
>> I think rxe_mr_init_user() maps the user addr space to the kernel addr
>> space during memory region registration, the mapping records are saved
>> into mr->cur_map_set->map[x].
> There is no way to touch user memory from the CPU in the kernel
That's absolutely right, but I don't think it references that user memory directly.

> without calling one of the kmap's, so I don't know what this thinks it
> is doing.
>
> Jason

IMHO, for the rxe, rxe_mr_init_user() will call get_user_page() to pin iova first, and then
the page address will be recorded into mr->cur_map_set->map[x]. So that when we want
to reference iova's kernel address, we can call iova_to_vaddr() where it will retrieve its kernel
address by travel the mr->cur_map_set->map[x].

Currently RDMA WRITE, RDMA ATOMIC and etc use the same scheme to reference to iova.
Feel free to correct me if i missed something :)

Do you mean we should retrieve iova's page first, and the reference the kernel address by
kmap(), sorry for my stupid question ?

Thanks
Zhijian
Jason Gunthorpe Jan. 19, 2022, 12:36 p.m. UTC | #8
On Wed, Jan 19, 2022 at 01:54:32AM +0000, lizhijian@fujitsu.com wrote:
> 
> 
> On 18/01/2022 20:35, Jason Gunthorpe wrote:
> > On Tue, Jan 18, 2022 at 08:01:59AM +0000, yangx.jy@fujitsu.com wrote:
> >> On 2022/1/17 21:16, Jason Gunthorpe wrote:
> >>> On Thu, Jan 13, 2022 at 11:03:50AM +0800, Xiao Yang wrote:
> >>>> +static enum resp_states process_atomic_write(struct rxe_qp *qp,
> >>>> +					     struct rxe_pkt_info *pkt)
> >>>> +{
> >>>> +	struct rxe_mr *mr = qp->resp.mr;
> >>>> +
> >>>> +	u64 *src = payload_addr(pkt);
> >>>> +
> >>>> +	u64 *dst = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset, sizeof(u64));
> >>>> +	if (!dst || (uintptr_t)dst&  7)
> >>>> +		return RESPST_ERR_MISALIGNED_ATOMIC;
> >>> It looks to me like iova_to_vaddr is completely broken, where is the
> >>> kmap on that flow?
> >> Hi Jason,
> >>
> >> I think rxe_mr_init_user() maps the user addr space to the kernel addr
> >> space during memory region registration, the mapping records are saved
> >> into mr->cur_map_set->map[x].
> > There is no way to touch user memory from the CPU in the kernel
> That's absolutely right, but I don't think it references that user memory directly.
> 
> > without calling one of the kmap's, so I don't know what this thinks it
> > is doing.
> >
> > Jason
> 
> IMHO, for the rxe, rxe_mr_init_user() will call get_user_page() to pin iova first, and then
> the page address will be recorded into mr->cur_map_set->map[x]. So that when we want
> to reference iova's kernel address, we can call iova_to_vaddr() where it will retrieve its kernel
> address by travel the mr->cur_map_set->map[x].

That flow needs a kmap

> Do you mean we should retrieve iova's page first, and the reference the kernel address by
> kmap(), sorry for my stupid question ?

Going from struct page to something the kernel can can touch requires
kmap

Jason
Ira Weiny Jan. 19, 2022, 4:47 p.m. UTC | #9
On Wed, Jan 19, 2022 at 08:36:35AM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 19, 2022 at 01:54:32AM +0000, lizhijian@fujitsu.com wrote:

[snip]

> > 
> > IMHO, for the rxe, rxe_mr_init_user() will call get_user_page() to pin iova first, and then
> > the page address will be recorded into mr->cur_map_set->map[x]. So that when we want
> > to reference iova's kernel address, we can call iova_to_vaddr() where it will retrieve its kernel
> > address by travel the mr->cur_map_set->map[x].
> 
> That flow needs a kmap
> 
> > Do you mean we should retrieve iova's page first, and the reference the kernel address by
> > kmap(), sorry for my stupid question ?
> 
> Going from struct page to something the kernel can can touch requires
> kmap

kmap() the call is being deprecated.  Please use kmap_local_page() if possible.

https://lore.kernel.org/lkml/20211210232404.4098157-1-ira.weiny@intel.com/

Thanks,
Ira
Li Zhijian Jan. 20, 2022, 12:07 p.m. UTC | #10
Hi Jason


on 2022/1/19 20:36, Jason Gunthorpe wrote:
> On Wed, Jan 19, 2022 at 01:54:32AM +0000, lizhijian@fujitsu.com wrote:
>>
>> On 18/01/2022 20:35, Jason Gunthorpe wrote:
>>> On Tue, Jan 18, 2022 at 08:01:59AM +0000, yangx.jy@fujitsu.com wrote:
>>>> On 2022/1/17 21:16, Jason Gunthorpe wrote:
>>>>> On Thu, Jan 13, 2022 at 11:03:50AM +0800, Xiao Yang wrote:
>>>>>> +static enum resp_states process_atomic_write(struct rxe_qp *qp,
>>>>>> +					     struct rxe_pkt_info *pkt)
>>>>>> +{
>>>>>> +	struct rxe_mr *mr = qp->resp.mr;
>>>>>> +
>>>>>> +	u64 *src = payload_addr(pkt);
>>>>>> +
>>>>>> +	u64 *dst = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset, sizeof(u64));
>>>>>> +	if (!dst || (uintptr_t)dst&  7)
>>>>>> +		return RESPST_ERR_MISALIGNED_ATOMIC;
>>>>> It looks to me like iova_to_vaddr is completely broken, where is the
>>>>> kmap on that flow?
>>>> Hi Jason,
>>>>
>>>> I think rxe_mr_init_user() maps the user addr space to the kernel addr
>>>> space during memory region registration, the mapping records are saved
>>>> into mr->cur_map_set->map[x].
>>> There is no way to touch user memory from the CPU in the kernel
>> That's absolutely right, but I don't think it references that user memory directly.
>>
>>> without calling one of the kmap's, so I don't know what this thinks it
>>> is doing.
>>>
>>> Jason
>> IMHO, for the rxe, rxe_mr_init_user() will call get_user_page() to pin iova first, and then
>> the page address will be recorded into mr->cur_map_set->map[x]. So that when we want
>> to reference iova's kernel address, we can call iova_to_vaddr() where it will retrieve its kernel
>> address by travel the mr->cur_map_set->map[x].
> That flow needs a kmap

IIUC, this was a existing issue in iova_to_vaddr() right ?
Alternatively, can we have below changes to fix it simply?

diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c 
b/drivers/infiniband/sw/rxe/rxe_mr.c
index 0621d387ccba..978fdd23665c 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -260,7 +260,8 @@ int rxe_mr_init_user(struct rxe_pd *pd, u64 start, 
u64 length, u64 iova,
                                 num_buf = 0;
                         }

-                       vaddr = page_address(sg_page_iter_page(&sg_iter));
+                       // FIXME: don't forget to kunmap_local(vaddr)
+                       vaddr = 
kmap_local_page(sg_page_iter_page(&sg_iter));
                         if (!vaddr) {
                                 pr_warn("%s: Unable to get virtual 
address\n",
                                                 __func__);



>
>> Do you mean we should retrieve iova's page first, and the reference the kernel address by
>> kmap(), sorry for my stupid question ?
> Going from struct page to something the kernel can can touch requires
> kmap

Got it

Thanks

Zhijian


>
> Jason
Jason Gunthorpe Jan. 21, 2022, 12:58 p.m. UTC | #11
On Thu, Jan 20, 2022 at 08:07:36PM +0800, Li, Zhijian wrote:

> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c
> b/drivers/infiniband/sw/rxe/rxe_mr.c
> index 0621d387ccba..978fdd23665c 100644
> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> @@ -260,7 +260,8 @@ int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64
> length, u64 iova,
>                                 num_buf = 0;
>                         }
> 
> -                       vaddr = page_address(sg_page_iter_page(&sg_iter));
> +                       // FIXME: don't forget to kunmap_local(vaddr)
> +                       vaddr = kmap_local_page(sg_page_iter_page(&sg_iter));

No, you can't leave the kmap open for a long time. The kmap has to
just be around the usage.

Jason
Ira Weiny Jan. 21, 2022, 4:06 p.m. UTC | #12
On Fri, Jan 21, 2022 at 08:58:37AM -0400, Jason Gunthorpe wrote:
> On Thu, Jan 20, 2022 at 08:07:36PM +0800, Li, Zhijian wrote:
> 
> > diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c
> > b/drivers/infiniband/sw/rxe/rxe_mr.c
> > index 0621d387ccba..978fdd23665c 100644
> > +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> > @@ -260,7 +260,8 @@ int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64
> > length, u64 iova,
> >                                 num_buf = 0;
> >                         }
> > 
> > -                       vaddr = page_address(sg_page_iter_page(&sg_iter));
> > +                       // FIXME: don't forget to kunmap_local(vaddr)
> > +                       vaddr = kmap_local_page(sg_page_iter_page(&sg_iter));
> 
> No, you can't leave the kmap open for a long time. The kmap has to
> just be around the usage.

Indeed Jason is correct here.  A couple of details here.  First
kmap_local_page() is only valid within the current thread of execution.  So
what you propose above will not work at all.  Second, kmap() is to be avoided.

Finally, that page_address() should be avoided IMO and will be broken, at least
for persistent memory pages, once some of my work lands.[*]  Jason would know
better, but I think page_address should be avoided in all driver code.  But
there is no clear documentation on that.

Taking a quick look at rxe_mr.c buf->addr is only used in rxe_mr_init_user().
You need to kmap_local_page() around that access.  What else is struct
rxe_phys_buf->addr used for?  Can you just map the page when you need it in
rxe_mr_init_user()?

If you must create a mapping that is permanent you could look at vmap().

Ira

> 
> Jason
Ira Weiny Jan. 21, 2022, 4:08 p.m. UTC | #13
On Fri, Jan 21, 2022 at 08:06:54AM -0800, 'Ira Weiny' wrote:
> On Fri, Jan 21, 2022 at 08:58:37AM -0400, Jason Gunthorpe wrote:
> > On Thu, Jan 20, 2022 at 08:07:36PM +0800, Li, Zhijian wrote:
> > 
> > > diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c
> > > b/drivers/infiniband/sw/rxe/rxe_mr.c
> > > index 0621d387ccba..978fdd23665c 100644
> > > +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> > > @@ -260,7 +260,8 @@ int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64
> > > length, u64 iova,
> > >                                 num_buf = 0;
> > >                         }
> > > 
> > > -                       vaddr = page_address(sg_page_iter_page(&sg_iter));
> > > +                       // FIXME: don't forget to kunmap_local(vaddr)
> > > +                       vaddr = kmap_local_page(sg_page_iter_page(&sg_iter));
> > 
> > No, you can't leave the kmap open for a long time. The kmap has to
> > just be around the usage.
> 
> Indeed Jason is correct here.  A couple of details here.  First
> kmap_local_page() is only valid within the current thread of execution.  So
> what you propose above will not work at all.  Second, kmap() is to be avoided.
> 
> Finally, that page_address() should be avoided IMO and will be broken, at least
> for persistent memory pages, once some of my work lands.[*]  Jason would know
> better, but I think page_address should be avoided in all driver code.  But
> there is no clear documentation on that.
> 
> Taking a quick look at rxe_mr.c buf->addr is only used in rxe_mr_init_user().
                                                            ^^^^^^^^^^^^^^
Sorry...                                            I meant rxe_mr_copy()...

> You need to kmap_local_page() around that access.  What else is struct
> rxe_phys_buf->addr used for?  Can you just map the page when you need it in
> rxe_mr_init_user()?

rxe_mr_copy()...

Ira

> 
> If you must create a mapping that is permanent you could look at vmap().
> 
> Ira
> 
> > 
> > Jason
Zhijian Li (Fujitsu) Jan. 24, 2022, 3:47 a.m. UTC | #14
Jason & Ira


On 22/01/2022 00:08, Ira Weiny wrote:
> On Fri, Jan 21, 2022 at 08:06:54AM -0800, 'Ira Weiny' wrote:
>> On Fri, Jan 21, 2022 at 08:58:37AM -0400, Jason Gunthorpe wrote:
>>> On Thu, Jan 20, 2022 at 08:07:36PM +0800, Li, Zhijian wrote:
>>>
>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c
>>>> b/drivers/infiniband/sw/rxe/rxe_mr.c
>>>> index 0621d387ccba..978fdd23665c 100644
>>>> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
>>>> @@ -260,7 +260,8 @@ int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64
>>>> length, u64 iova,
>>>>                                  num_buf = 0;
>>>>                          }
>>>>
>>>> -                       vaddr = page_address(sg_page_iter_page(&sg_iter));
>>>> +                       // FIXME: don't forget to kunmap_local(vaddr)
>>>> +                       vaddr = kmap_local_page(sg_page_iter_page(&sg_iter));
>>> No, you can't leave the kmap open for a long time. The kmap has to
>>> just be around the usage.
>> Indeed Jason is correct here.  A couple of details here.  First
>> kmap_local_page() is only valid within the current thread of execution.  So
>> what you propose above will not work at all.

Thanks you all for the details explanation. It *does* make sense.



>>   Second, kmap() is to be avoided.
>>
>> Finally, that page_address() should be avoided IMO and will be broken, at least
>> for persistent memory pages, once some of my work lands.[*]  Jason would know
>> better, but I think page_address should be avoided in all driver code.  But
>> there is no clear documentation on that.
>>
>> Taking a quick look at rxe_mr.c buf->addr is only used in rxe_mr_init_user().
>                                                              ^^^^^^^^^^^^^^
> Sorry...                                            I meant rxe_mr_copy()...

iova_to_vaddr() is another user by process_atomic().



>
>> You need to kmap_local_page() around that access.  What else is struct
>> rxe_phys_buf->addr used for?

IIUC, rxe_phys_buf->addr is used for permanently mapping a user space address to kernel space address.
So in RDMA scene, REMOTE side can access(read/write) LOCAL memory allocated by user space application directly.


>> Can you just map the page when you need it in
>> rxe_mr_init_user()?

It can be in theory, but  I'm not sure. @Jason, what's your opinion.


> rxe_mr_copy()...
>
> Ira
>
>> If you must create a mapping that is permanent you could look at vmap().

Well, let me see


Thanks
Zhijian

>>
>> Ira
>>
>>> Jason
Xiao Yang Jan. 27, 2022, 9:37 a.m. UTC | #15
On 2022/1/22 0:06, Ira Weiny wrote:
> On Fri, Jan 21, 2022 at 08:58:37AM -0400, Jason Gunthorpe wrote:
>> On Thu, Jan 20, 2022 at 08:07:36PM +0800, Li, Zhijian wrote:
>>
>>> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c
>>> b/drivers/infiniband/sw/rxe/rxe_mr.c
>>> index 0621d387ccba..978fdd23665c 100644
>>> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
>>> @@ -260,7 +260,8 @@ int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64
>>> length, u64 iova,
>>>                                  num_buf = 0;
>>>                          }
>>>
>>> -                       vaddr = page_address(sg_page_iter_page(&sg_iter));
>>> +                       // FIXME: don't forget to kunmap_local(vaddr)
>>> +                       vaddr = kmap_local_page(sg_page_iter_page(&sg_iter));
>> No, you can't leave the kmap open for a long time. The kmap has to
>> just be around the usage.

Cc Christoph Hellwig

Hi Jason, Ira

Sorry for the late reply.

Do you mean we have to consider that some allocated pages come from high 
memory?

I think INFINIBAND_VIRT_DMA kconfig[1] has ensured that all allocated 
pages have a kernel virtual address.

In this case, is it OK to call page_address() directly?

[1]: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b1e678bf290db5a76f1b6a9f7c381310e03440d6


> Indeed Jason is correct here.  A couple of details here.  First
> kmap_local_page() is only valid within the current thread of execution.  So
> what you propose above will not work at all.  Second, kmap() is to be avoided.
>
> Finally, that page_address() should be avoided IMO and will be broken, at least
> for persistent memory pages, once some of my work lands.[*]  Jason would know
> better, but I think page_address should be avoided in all driver code.  But
> there is no clear documentation on that.

Could you explain why page_address() will be broken for persistent 
memory pages?

Sorry, I am not familiar with the principle of persistent memory.

Best Regards,

Xiao Yang

>
> Taking a quick look at rxe_mr.c buf->addr is only used in rxe_mr_init_user().
> You need to kmap_local_page() around that access.  What else is struct
> rxe_phys_buf->addr used for?  Can you just map the page when you need it in
> rxe_mr_init_user()?
>
> If you must create a mapping that is permanent you could look at vmap().
>
> Ira
>
>> Jason
Christoph Hellwig Jan. 27, 2022, 9:57 a.m. UTC | #16
On Thu, Jan 27, 2022 at 09:37:59AM +0000, yangx.jy@fujitsu.com wrote:
> Do you mean we have to consider that some allocated pages come from high 
> memory?
>
> I think INFINIBAND_VIRT_DMA kconfig[1] has ensured that all allocated 
> pages have a kernel virtual address.

rxe and siw depend on INFINIBAND_VIRT_DMA which depends on !HIGHMEM,
so you don't need kmap here at all.

> In this case, is it OK to call page_address() directly?

Yes.
Ira Weiny Jan. 27, 2022, 6:08 p.m. UTC | #17
On Thu, Jan 27, 2022 at 10:57:30AM +0100, Christoph Hellwig wrote:
> On Thu, Jan 27, 2022 at 09:37:59AM +0000, yangx.jy@fujitsu.com wrote:
> > Do you mean we have to consider that some allocated pages come from high 
> > memory?
> >
> > I think INFINIBAND_VIRT_DMA kconfig[1] has ensured that all allocated 
> > pages have a kernel virtual address.
> 
> rxe and siw depend on INFINIBAND_VIRT_DMA which depends on !HIGHMEM,
> so you don't need kmap here at all.

Until/if I get PKS protecting pmem.[1]  Then if the page is pmem, page_address()
will give you an address which you will fault on when you access it.

> 
> > In this case, is it OK to call page_address() directly?
> 
> Yes.

For now yes...  But please use kmap_local_page() and it will do the right thing
(by default call page_address() on !HIGHMEM systems).

IMO page_address() is a hold over from a time when memory management was much
simpler and, again IMO, today its use assumes too much for drivers like this.
As more protections on memory are implemented it presents a series of land
mines to be found.

While kmap is also a hold over I'm trying to redefine it to be 'get the kernel
mapping' rather than 'map this into highmem'...  But perhaps I'm losing that
battle...

Ira

[1] https://lore.kernel.org/lkml/20220127175505.851391-1-ira.weiny@intel.com/T/#mcd60ea9a9c7b90e63b8d333c9270186fc7e47707
Christoph Hellwig Jan. 28, 2022, 6:16 a.m. UTC | #18
On Thu, Jan 27, 2022 at 10:08:33AM -0800, Ira Weiny wrote:
> On Thu, Jan 27, 2022 at 10:57:30AM +0100, Christoph Hellwig wrote:
> > On Thu, Jan 27, 2022 at 09:37:59AM +0000, yangx.jy@fujitsu.com wrote:
> > > Do you mean we have to consider that some allocated pages come from high 
> > > memory?
> > >
> > > I think INFINIBAND_VIRT_DMA kconfig[1] has ensured that all allocated 
> > > pages have a kernel virtual address.
> > 
> > rxe and siw depend on INFINIBAND_VIRT_DMA which depends on !HIGHMEM,
> > so you don't need kmap here at all.
> 
> Until/if I get PKS protecting pmem.[1]  Then if the page is pmem, page_address()
> will give you an address which you will fault on when you access it.

In that case we'll have problems all over the drivers that select
INFINIBAND_VIRT_DMA, so this one doesn't make much of a difference..
Ira Weiny Jan. 28, 2022, 7:15 p.m. UTC | #19
On Fri, Jan 28, 2022 at 07:16:18AM +0100, Christoph Hellwig wrote:
> On Thu, Jan 27, 2022 at 10:08:33AM -0800, Ira Weiny wrote:
> > On Thu, Jan 27, 2022 at 10:57:30AM +0100, Christoph Hellwig wrote:
> > > On Thu, Jan 27, 2022 at 09:37:59AM +0000, yangx.jy@fujitsu.com wrote:
> > > > Do you mean we have to consider that some allocated pages come from high 
> > > > memory?
> > > >
> > > > I think INFINIBAND_VIRT_DMA kconfig[1] has ensured that all allocated 
> > > > pages have a kernel virtual address.
> > > 
> > > rxe and siw depend on INFINIBAND_VIRT_DMA which depends on !HIGHMEM,
> > > so you don't need kmap here at all.
> > 
> > Until/if I get PKS protecting pmem.[1]  Then if the page is pmem, page_address()
> > will give you an address which you will fault on when you access it.
> 
> In that case we'll have problems all over the drivers that select
> INFINIBAND_VIRT_DMA, so this one doesn't make much of a difference..

Ah...  ok...

Obviously I've not kept up with the software providers...

For this use case, can kmap_local_page() work?

Ira
Xiao Yang Feb. 10, 2022, 11:06 a.m. UTC | #20
On 2022/1/29 3:15, Ira Weiny wrote:

> Ah...  ok...
>
> Obviously I've not kept up with the software providers...
>
> For this use case, can kmap_local_page() work?
>
> Ira

Hi Ira,

I applied your patch set on my branch but ibv_reg_mr() with /dev/dax0.0 
didn't trigger any warning or panic after enabling DEVMAP_ACCESS_PROTECTION.

PS: ibv_reg_mr() calls page_address() in kernel.

Do you know if some hardwares(e.g. CPU) need to support PKS feature? how 
to check the feature on hardwares? or do I need more steps?

Best Regards,

Xiao Yang
Ira Weiny Feb. 11, 2022, 6:30 p.m. UTC | #21
On Thu, Feb 10, 2022 at 11:06:55AM +0000, yangx.jy@fujitsu.com wrote:
> On 2022/1/29 3:15, Ira Weiny wrote:
> 
> > Ah...  ok...
> >
> > Obviously I've not kept up with the software providers...
> >
> > For this use case, can kmap_local_page() work?
> >
> > Ira
> 
> Hi Ira,
> 
> I applied your patch set on my branch but ibv_reg_mr() with /dev/dax0.0 
> didn't trigger any warning or panic after enabling DEVMAP_ACCESS_PROTECTION.
> 
> PS: ibv_reg_mr() calls page_address() in kernel.
> 
> Do you know if some hardwares(e.g. CPU) need to support PKS feature?

Yes you will need Sapphire rapids hardware or newer, or qemu 6.0 which also
supports PKS.

>
> how 
> to check the feature on hardwares? or do I need more steps?

My lscpu on qemu reports pks.  I suspect you don't have PKS on the hardware so
it is working.

Ira
diff mbox series

Patch

diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
index d771ba8449a1..5a5c0c3501de 100644
--- a/drivers/infiniband/sw/rxe/rxe_comp.c
+++ b/drivers/infiniband/sw/rxe/rxe_comp.c
@@ -104,6 +104,7 @@  static enum ib_wc_opcode wr_to_wc_opcode(enum ib_wr_opcode opcode)
 	case IB_WR_LOCAL_INV:			return IB_WC_LOCAL_INV;
 	case IB_WR_REG_MR:			return IB_WC_REG_MR;
 	case IB_WR_BIND_MW:			return IB_WC_BIND_MW;
+	case IB_WR_RDMA_ATOMIC_WRITE:		return IB_WC_RDMA_ATOMIC_WRITE;
 
 	default:
 		return 0xff;
@@ -256,6 +257,9 @@  static inline enum comp_state check_ack(struct rxe_qp *qp,
 		if ((syn & AETH_TYPE_MASK) != AETH_ACK)
 			return COMPST_ERROR;
 
+		if (wqe->wr.opcode == IB_WR_RDMA_ATOMIC_WRITE)
+			return COMPST_WRITE_SEND;
+
 		fallthrough;
 		/* (IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE doesn't have an AETH)
 		 */
diff --git a/drivers/infiniband/sw/rxe/rxe_opcode.c b/drivers/infiniband/sw/rxe/rxe_opcode.c
index 3ef5a10a6efd..ec0f795adb5c 100644
--- a/drivers/infiniband/sw/rxe/rxe_opcode.c
+++ b/drivers/infiniband/sw/rxe/rxe_opcode.c
@@ -103,6 +103,12 @@  struct rxe_wr_opcode_info rxe_wr_opcode_info[] = {
 			[IB_QPT_UC]	= WR_LOCAL_OP_MASK,
 		},
 	},
+	[IB_WR_RDMA_ATOMIC_WRITE]                       = {
+		.name   = "IB_WR_RDMA_ATOMIC_WRITE",
+		.mask   = {
+			[IB_QPT_RC]     = WR_ATOMIC_WRITE_MASK,
+		},
+	},
 };
 
 struct rxe_opcode_info rxe_opcode[RXE_NUM_OPCODE] = {
@@ -379,6 +385,18 @@  struct rxe_opcode_info rxe_opcode[RXE_NUM_OPCODE] = {
 						+ RXE_IETH_BYTES,
 		}
 	},
+	[IB_OPCODE_RC_RDMA_ATOMIC_WRITE]                        = {
+		.name   = "IB_OPCODE_RC_RDMA_ATOMIC_WRITE",
+		.mask   = RXE_RETH_MASK | RXE_PAYLOAD_MASK | RXE_REQ_MASK
+				| RXE_ATOMIC_WRITE_MASK | RXE_START_MASK
+				| RXE_END_MASK,
+		.length = RXE_BTH_BYTES + RXE_RETH_BYTES,
+		.offset = {
+			[RXE_BTH]       = 0,
+			[RXE_RETH]      = RXE_BTH_BYTES,
+			[RXE_PAYLOAD]   = RXE_BTH_BYTES + RXE_RETH_BYTES,
+		}
+	},
 
 	/* UC */
 	[IB_OPCODE_UC_SEND_FIRST]			= {
diff --git a/drivers/infiniband/sw/rxe/rxe_opcode.h b/drivers/infiniband/sw/rxe/rxe_opcode.h
index 8f9aaaf260f2..a470e9b0b884 100644
--- a/drivers/infiniband/sw/rxe/rxe_opcode.h
+++ b/drivers/infiniband/sw/rxe/rxe_opcode.h
@@ -20,6 +20,7 @@  enum rxe_wr_mask {
 	WR_READ_MASK			= BIT(3),
 	WR_WRITE_MASK			= BIT(4),
 	WR_LOCAL_OP_MASK		= BIT(5),
+	WR_ATOMIC_WRITE_MASK		= BIT(7),
 
 	WR_READ_OR_WRITE_MASK		= WR_READ_MASK | WR_WRITE_MASK,
 	WR_WRITE_OR_SEND_MASK		= WR_WRITE_MASK | WR_SEND_MASK,
@@ -81,6 +82,8 @@  enum rxe_hdr_mask {
 
 	RXE_LOOPBACK_MASK	= BIT(NUM_HDR_TYPES + 12),
 
+	RXE_ATOMIC_WRITE_MASK   = BIT(NUM_HDR_TYPES + 14),
+
 	RXE_READ_OR_ATOMIC_MASK	= (RXE_READ_MASK | RXE_ATOMIC_MASK),
 	RXE_WRITE_OR_SEND_MASK	= (RXE_WRITE_MASK | RXE_SEND_MASK),
 	RXE_READ_OR_WRITE_MASK	= (RXE_READ_MASK | RXE_WRITE_MASK),
diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
index f3eec350f95c..22a1c4bcfa60 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -135,7 +135,8 @@  static void free_rd_atomic_resources(struct rxe_qp *qp)
 
 void free_rd_atomic_resource(struct rxe_qp *qp, struct resp_res *res)
 {
-	if (res->type == RXE_ATOMIC_MASK) {
+	if (res->type == RXE_ATOMIC_MASK ||
+			res->type == RXE_ATOMIC_WRITE_MASK) {
 		kfree_skb(res->resp.skb);
 	} else if (res->type == RXE_READ_MASK) {
 		if (res->read.mr)
diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
index 0c9d2af15f3d..203d8d19f84a 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -240,6 +240,10 @@  static int next_opcode_rc(struct rxe_qp *qp, u32 opcode, int fits)
 		else
 			return fits ? IB_OPCODE_RC_SEND_ONLY_WITH_INVALIDATE :
 				IB_OPCODE_RC_SEND_FIRST;
+
+	case IB_WR_RDMA_ATOMIC_WRITE:
+		return IB_OPCODE_RC_RDMA_ATOMIC_WRITE;
+
 	case IB_WR_REG_MR:
 	case IB_WR_LOCAL_INV:
 		return opcode;
@@ -485,6 +489,9 @@  static int finish_packet(struct rxe_qp *qp, struct rxe_send_wqe *wqe,
 		}
 	}
 
+	if (pkt->mask & RXE_ATOMIC_WRITE_MASK)
+		memcpy(payload_addr(pkt), &wqe->wr.wr.rdma.atomic_wr, paylen);
+
 	return 0;
 }
 
@@ -680,13 +687,13 @@  int rxe_requester(void *arg)
 	}
 
 	mask = rxe_opcode[opcode].mask;
-	if (unlikely(mask & RXE_READ_OR_ATOMIC_MASK)) {
+	if (unlikely(mask & (RXE_READ_OR_ATOMIC_MASK | RXE_ATOMIC_WRITE_MASK))) {
 		if (check_init_depth(qp, wqe))
 			goto exit;
 	}
 
 	mtu = get_mtu(qp);
-	payload = (mask & RXE_WRITE_OR_SEND_MASK) ? wqe->dma.resid : 0;
+	payload = (mask & (RXE_WRITE_OR_SEND_MASK | RXE_ATOMIC_WRITE_MASK)) ? wqe->dma.resid : 0;
 	if (payload > mtu) {
 		if (qp_type(qp) == IB_QPT_UD) {
 			/* C10-93.1.1: If the total sum of all the buffer lengths specified for a
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index e015860e8c34..cf678a3aaaa9 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -258,7 +258,7 @@  static enum resp_states check_op_valid(struct rxe_qp *qp,
 	case IB_QPT_RC:
 		if (((pkt->mask & RXE_READ_MASK) &&
 		     !(qp->attr.qp_access_flags & IB_ACCESS_REMOTE_READ)) ||
-		    ((pkt->mask & RXE_WRITE_MASK) &&
+		    ((pkt->mask & (RXE_WRITE_MASK | RXE_ATOMIC_WRITE_MASK)) &&
 		     !(qp->attr.qp_access_flags & IB_ACCESS_REMOTE_WRITE)) ||
 		    ((pkt->mask & RXE_ATOMIC_MASK) &&
 		     !(qp->attr.qp_access_flags & IB_ACCESS_REMOTE_ATOMIC))) {
@@ -362,7 +362,7 @@  static enum resp_states check_resource(struct rxe_qp *qp,
 		}
 	}
 
-	if (pkt->mask & RXE_READ_OR_ATOMIC_MASK) {
+	if (pkt->mask & (RXE_READ_OR_ATOMIC_MASK | RXE_ATOMIC_WRITE_MASK)) {
 		/* it is the requesters job to not send
 		 * too many read/atomic ops, we just
 		 * recycle the responder resource queue
@@ -413,7 +413,7 @@  static enum resp_states check_rkey(struct rxe_qp *qp,
 	enum resp_states state;
 	int access;
 
-	if (pkt->mask & RXE_READ_OR_WRITE_MASK) {
+	if (pkt->mask & (RXE_READ_OR_WRITE_MASK | RXE_ATOMIC_WRITE_MASK)) {
 		if (pkt->mask & RXE_RETH_MASK) {
 			qp->resp.va = reth_va(pkt);
 			qp->resp.offset = 0;
@@ -479,7 +479,7 @@  static enum resp_states check_rkey(struct rxe_qp *qp,
 		goto err;
 	}
 
-	if (pkt->mask & RXE_WRITE_MASK)	 {
+	if (pkt->mask & (RXE_WRITE_MASK | RXE_ATOMIC_WRITE_MASK)) {
 		if (resid > mtu) {
 			if (pktlen != mtu || bth_pad(pkt)) {
 				state = RESPST_ERR_LENGTH;
@@ -591,6 +591,26 @@  static enum resp_states process_atomic(struct rxe_qp *qp,
 	return ret;
 }
 
+static enum resp_states process_atomic_write(struct rxe_qp *qp,
+					     struct rxe_pkt_info *pkt)
+{
+	struct rxe_mr *mr = qp->resp.mr;
+
+	u64 *src = payload_addr(pkt);
+
+	u64 *dst = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset, sizeof(u64));
+	if (!dst || (uintptr_t)dst & 7)
+		return RESPST_ERR_MISALIGNED_ATOMIC;
+
+	/* Do atomic write after all prior operations have completed */
+	smp_store_release(dst, *src);
+
+	/* decrease resp.resid to zero */
+	qp->resp.resid -= sizeof(u64);
+
+	return RESPST_NONE;
+}
+
 static struct sk_buff *prepare_ack_packet(struct rxe_qp *qp,
 					  struct rxe_pkt_info *pkt,
 					  struct rxe_pkt_info *ack,
@@ -801,6 +821,10 @@  static enum resp_states execute(struct rxe_qp *qp, struct rxe_pkt_info *pkt)
 		err = process_atomic(qp, pkt);
 		if (err)
 			return err;
+	} else if (pkt->mask & RXE_ATOMIC_WRITE_MASK) {
+		err = process_atomic_write(qp, pkt);
+		if (err)
+			return err;
 	} else {
 		/* Unreachable */
 		WARN_ON_ONCE(1);
@@ -965,9 +989,12 @@  static int send_resp(struct rxe_qp *qp, struct rxe_pkt_info *pkt,
 	struct sk_buff *skb;
 	struct resp_res *res;
 
+	int opcode = pkt->mask & RXE_ATOMIC_MASK ?
+				IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE :
+				IB_OPCODE_RC_RDMA_READ_RESPONSE_ONLY;
+
 	skb = prepare_ack_packet(qp, pkt, &ack_pkt,
-				 IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE, 0, pkt->psn,
-				 syndrome);
+				 opcode, 0, pkt->psn, syndrome);
 	if (!skb) {
 		rc = -ENOMEM;
 		goto out;
@@ -978,7 +1005,7 @@  static int send_resp(struct rxe_qp *qp, struct rxe_pkt_info *pkt,
 	rxe_advance_resp_resource(qp);
 
 	skb_get(skb);
-	res->type = RXE_ATOMIC_MASK;
+	res->type = pkt->mask & (RXE_ATOMIC_MASK | RXE_ATOMIC_WRITE_MASK);
 	res->resp.skb = skb;
 	res->first_psn = ack_pkt.psn;
 	res->last_psn  = ack_pkt.psn;
@@ -1001,7 +1028,7 @@  static enum resp_states acknowledge(struct rxe_qp *qp,
 
 	if (qp->resp.aeth_syndrome != AETH_ACK_UNLIMITED)
 		send_ack(qp, pkt, qp->resp.aeth_syndrome, pkt->psn);
-	else if (pkt->mask & RXE_ATOMIC_MASK)
+	else if (pkt->mask & (RXE_ATOMIC_MASK | RXE_ATOMIC_WRITE_MASK))
 		send_resp(qp, pkt, AETH_ACK_UNLIMITED);
 	else if (bth_ack(pkt))
 		send_ack(qp, pkt, AETH_ACK_UNLIMITED, pkt->psn);
diff --git a/include/rdma/ib_pack.h b/include/rdma/ib_pack.h
index a9162f25beaf..519ec6b841e7 100644
--- a/include/rdma/ib_pack.h
+++ b/include/rdma/ib_pack.h
@@ -84,6 +84,7 @@  enum {
 	/* opcode 0x15 is reserved */
 	IB_OPCODE_SEND_LAST_WITH_INVALIDATE         = 0x16,
 	IB_OPCODE_SEND_ONLY_WITH_INVALIDATE         = 0x17,
+	IB_OPCODE_RDMA_ATOMIC_WRITE                 = 0x1D,
 
 	/* real constants follow -- see comment about above IB_OPCODE()
 	   macro for more details */
@@ -112,6 +113,7 @@  enum {
 	IB_OPCODE(RC, FETCH_ADD),
 	IB_OPCODE(RC, SEND_LAST_WITH_INVALIDATE),
 	IB_OPCODE(RC, SEND_ONLY_WITH_INVALIDATE),
+	IB_OPCODE(RC, RDMA_ATOMIC_WRITE),
 
 	/* UC */
 	IB_OPCODE(UC, SEND_FIRST),
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 6e9ad656ecb7..949b3586d35b 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -971,6 +971,7 @@  enum ib_wc_opcode {
 	IB_WC_REG_MR,
 	IB_WC_MASKED_COMP_SWAP,
 	IB_WC_MASKED_FETCH_ADD,
+	IB_WC_RDMA_ATOMIC_WRITE = IB_UVERBS_WC_RDMA_ATOMIC_WRITE,
 /*
  * Set value of IB_WC_RECV so consumers can test if a completion is a
  * receive by testing (opcode & IB_WC_RECV).
@@ -1311,6 +1312,7 @@  enum ib_wr_opcode {
 		IB_UVERBS_WR_MASKED_ATOMIC_CMP_AND_SWP,
 	IB_WR_MASKED_ATOMIC_FETCH_AND_ADD =
 		IB_UVERBS_WR_MASKED_ATOMIC_FETCH_AND_ADD,
+	IB_WR_RDMA_ATOMIC_WRITE = IB_UVERBS_WR_RDMA_ATOMIC_WRITE,
 
 	/* These are kernel only and can not be issued by userspace */
 	IB_WR_REG_MR = 0x20,
diff --git a/include/uapi/rdma/ib_user_verbs.h b/include/uapi/rdma/ib_user_verbs.h
index 7ee73a0652f1..3b0b509fb96f 100644
--- a/include/uapi/rdma/ib_user_verbs.h
+++ b/include/uapi/rdma/ib_user_verbs.h
@@ -466,6 +466,7 @@  enum ib_uverbs_wc_opcode {
 	IB_UVERBS_WC_BIND_MW = 5,
 	IB_UVERBS_WC_LOCAL_INV = 6,
 	IB_UVERBS_WC_TSO = 7,
+	IB_UVERBS_WC_RDMA_ATOMIC_WRITE = 9,
 };
 
 struct ib_uverbs_wc {
@@ -784,6 +785,7 @@  enum ib_uverbs_wr_opcode {
 	IB_UVERBS_WR_RDMA_READ_WITH_INV = 11,
 	IB_UVERBS_WR_MASKED_ATOMIC_CMP_AND_SWP = 12,
 	IB_UVERBS_WR_MASKED_ATOMIC_FETCH_AND_ADD = 13,
+	IB_UVERBS_WR_RDMA_ATOMIC_WRITE = 15,
 	/* Review enum ib_wr_opcode before modifying this */
 };
 
diff --git a/include/uapi/rdma/rdma_user_rxe.h b/include/uapi/rdma/rdma_user_rxe.h
index f09c5c9e3dd5..7e02c614d826 100644
--- a/include/uapi/rdma/rdma_user_rxe.h
+++ b/include/uapi/rdma/rdma_user_rxe.h
@@ -86,6 +86,7 @@  struct rxe_send_wr {
 			__aligned_u64 remote_addr;
 			__u32	rkey;
 			__u32	reserved;
+			__aligned_u64 atomic_wr;
 		} rdma;
 		struct {
 			__aligned_u64 remote_addr;