diff mbox series

[for-next,v2,2/4] IB/hfi1: Move rvt_cq_wc struct into uapi directory

Message ID 20181205151044.27132.1576.stgit@scvm10.sc.intel.com (mailing list archive)
State Changes Requested
Delegated to: Jason Gunthorpe
Headers show
Series IB/hfi1: Clean up and code improvements | expand

Commit Message

Dennis Dalessandro Dec. 5, 2018, 3:10 p.m. UTC
From: Kamenee Arumugam <kamenee.arumugam@intel.com>

The rvt_cq_wc struct elements are shared between rdmavt
and the providers but not in uapi directory.
As per the comment in
https://marc.info/?l=linux-rdma&m=152296522708522&w=2
The hfi1 driver and the rdma core driver are not using
shared structures in the uapi directory.

In that case, move rvt_cq_wc struct into the rvt-abi.h
header file and create a rvt_k_cq_w for the kernel
completion queue.

Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/hw/hfi1/qp.c   |   32 ++++++
 drivers/infiniband/sw/rdmavt/cq.c |  189 ++++++++++++++++++++++++-------------
 include/rdma/rdmavt_cq.h          |   10 +-
 include/uapi/rdma/rvt-abi.h       |   31 ++++++
 4 files changed, 186 insertions(+), 76 deletions(-)
 create mode 100644 include/uapi/rdma/rvt-abi.h

Comments

Jason Gunthorpe Dec. 7, 2018, 4:07 a.m. UTC | #1
On Wed, Dec 05, 2018 at 07:10:52AM -0800, Dennis Dalessandro wrote:
> +#include <linux/types.h>
> +#include <rdma/ib_user_verbs.h>
> +#ifndef RDMA_ATOMIC_UAPI
> +#define RDMA_ATOMIC_UAPI(_type, _name) _type _name
> +#endif
> +/*
> + * This structure is used to contain the head pointer, tail pointer,
> + * and completion queue entries as a single memory allocation so
> + * it can be mmap'ed into user space.
> + */
> +struct rvt_cq_wc {
> +	/* index of next entry to fill */
> +	RDMA_ATOMIC_UAPI(u32, head);
> +	/* index of next ib_poll_cq() entry */
> +	RDMA_ATOMIC_UAPI(u32, tail);

I went and looked at all the readers of these variables, and they are
all missing a READ_ONCE (ie the kernel version of
atomic_load(relaxed))

I think you should change the definition to be:

#define RDMA_ATOMIC_UAPI(_type, _name) struct {_type atomic_val;} _name

And then add a kernel function

#define RDMA_READ_UAPI_ATOMIC(_member) READ_ONCE((_memb)->atomic_val)

So that the compiler protects against this mistake.

Jason
Arumugam, Kamenee Dec. 7, 2018, 8:26 p.m. UTC | #2
Subject: Re: [PATCH for-next v2 2/4] IB/hfi1: Move rvt_cq_wc struct into uapi directory

On Wed, Dec 05, 2018 at 07:10:52AM -0800, Dennis Dalessandro wrote:
> +#include <linux/types.h>
> +#include <rdma/ib_user_verbs.h>
> +#ifndef RDMA_ATOMIC_UAPI
> +#define RDMA_ATOMIC_UAPI(_type, _name) _type _name #endif
> +/*
> + * This structure is used to contain the head pointer, tail pointer,
> + * and completion queue entries as a single memory allocation so
> + * it can be mmap'ed into user space.
> + */
> +struct rvt_cq_wc {
> +	/* index of next entry to fill */
> +	RDMA_ATOMIC_UAPI(u32, head);
> +	/* index of next ib_poll_cq() entry */
> +	RDMA_ATOMIC_UAPI(u32, tail);

> I went and looked at all the readers of these variables, and they are all missing a READ_ONCE (ie the kernel version of
> atomic_load(relaxed))

> I think you should change the definition to be:

> #define RDMA_ATOMIC_UAPI(_type, _name) struct {_type atomic_val;} _name

> And then add a kernel function

> #define RDMA_READ_UAPI_ATOMIC(_member) READ_ONCE((_memb)->atomic_val)

> So that the compiler protects against this mistake.

There is already a producer and consumer lock for the circular buffer and only reading the opposition index needs the READ_ONCE control. Example whenever a producer fills the buffer, producer lock is used and READ_ONCE used to for tail variable. We doesn't need to use READ_ONCE for head variable as it is already in context of locking mechanism and additional memory barrier and volatile casts might only make it slower in execution and prevent the advantage of compiler optimization here. We took design model from below :

https://www.kernel.org/doc/Documentation/circular-buffers.txt

Therefore, adding this kernel function  is not needed if the index that is being read is protected by the spin lock.  Only the peer index requires the READ_ONCE.

We going to make sure only those code path that  should have READ_ONCE is in place correctly.
Jason Gunthorpe Dec. 7, 2018, 8:32 p.m. UTC | #3
On Fri, Dec 07, 2018 at 08:26:10PM +0000, Arumugam, Kamenee wrote:
> Subject: Re: [PATCH for-next v2 2/4] IB/hfi1: Move rvt_cq_wc struct into uapi directory
> 
> On Wed, Dec 05, 2018 at 07:10:52AM -0800, Dennis Dalessandro wrote:
> > +#include <linux/types.h>
> > +#include <rdma/ib_user_verbs.h>
> > +#ifndef RDMA_ATOMIC_UAPI
> > +#define RDMA_ATOMIC_UAPI(_type, _name) _type _name #endif
> > +/*
> > + * This structure is used to contain the head pointer, tail pointer,
> > + * and completion queue entries as a single memory allocation so
> > + * it can be mmap'ed into user space.
> > + */
> > +struct rvt_cq_wc {
> > +	/* index of next entry to fill */
> > +	RDMA_ATOMIC_UAPI(u32, head);
> > +	/* index of next ib_poll_cq() entry */
> > +	RDMA_ATOMIC_UAPI(u32, tail);
> 
> > I went and looked at all the readers of these variables, and they are all missing a READ_ONCE (ie the kernel version of
> > atomic_load(relaxed))
> 
> > I think you should change the definition to be:
> 
> > #define RDMA_ATOMIC_UAPI(_type, _name) struct {_type atomic_val;} _name
> 
> > And then add a kernel function
> 
> > #define RDMA_READ_UAPI_ATOMIC(_member) READ_ONCE((_memb)->atomic_val)
> 
> > So that the compiler protects against this mistake.
> 
> There is already a producer and consumer lock for the circular
> buffer and only reading the opposition index needs the READ_ONCE
> control. 

No. 

Any read of memory mmap'd into user space must use READ_ONCE or risk a
security problem due to control-flow execution corruption.

Locks don't extend to untrusted userspace so they are useless for
this.

> https://www.kernel.org/doc/Documentation/circular-buffers.txt

This document is not considering the impact of mmaping the memory to
user space.

If you want to avoid the READ_ONCE then you have to maintain a shadow
value in secure kernel memory.

Same story for WRITE_ONCE.

This is why my suggestion to force the compiler to check this is
required to get it right.

Jason
Arumugam, Kamenee Feb. 1, 2019, 8:21 p.m. UTC | #4
> > I think you should change the definition to be:
> 
> > #define RDMA_ATOMIC_UAPI(_type, _name) struct {_type atomic_val;} 
> > _name
> 
> > And then add a kernel function
> 
> > #define RDMA_READ_UAPI_ATOMIC(_member) 
> > READ_ONCE((_memb)->atomic_val)
> 
> > So that the compiler protects against this mistake.
> 
> There is already a producer and consumer lock for the circular buffer 
> and only reading the opposition index needs the READ_ONCE control.

> No. 
> Any read of memory mmap'd into user space must use READ_ONCE or risk a security problem due to control-flow execution corruption.

As per your suggestion above, I have draft a quick prototype to understand and clear any assumption beforehand.  Below are 3 macros that will be introduce
in the rvt-abi header file.  The RDMA_READ_UAPI_ATOMIC and RDMA_WRITE_UAPI_ATOMIC macro are only applied to struct mmaped to  userspace. 

#define RDMA_ATOMIC_UAPI(_type, _name)struct{ _type val;}_name
#define RDMA_READ_UAPI_ATOMIC(member) READ_ONCE((member).val)
#define RDMA_WRITE_UAPI_ATOMIC(member, x) WRITE_ONCE((member).val, x)
struct rvt_cq_wc {
	/* index of next entry to fill */
	RDMA_ATOMIC_UAPI(__u32, head);
	/* index of next ib_poll_cq() entry */
	RDMA_ATOMIC_UAPI(__u32, tail);

	/* these are actually size ibcq.cqe + 1 */
	struct ib_uverbs_wc uqueue[];
};

Here is the use case of RDMA_READ_UAPI_ATOMIC macro  in cq.c file.  The READ_ONCE will be applied for any read on the indices of mmaped to user space. 
if (cq->ip) {
		u_wc = cq->queue;
		uqueue = &u_wc->uqueue[0];
		head = RDMA_READ_UAPI_ATOMIC(u_wc->head);
		tail = RDMA_READ_UAPI_ATOMIC(u_wc->tail);
	} else {
		k_wc = cq->kqueue;
		kqueue = &k_wc->kqueue[0];
		head = k_wc->head;
		tail = k_wc->tail;
	}


>Same story for WRITE_ONCE.

Here is the usage for RDMA_WRITE_UAPI_ATOMIC macro. For cq enter, we kept smp_wmb() in the code as it is now and we didn't adopt it into this macro. This macro
Is meant for compile barrier.

cq->ibcq.cqe = cqe;
	if (u_wc) {
		RDMA_WRITE_UAPI_ATOMIC(u_wc->head, n);
		RDMA_WRITE_UAPI_ATOMIC(u_wc->tail, 0);
		cq->queue = u_wc;
	} else {
		k_wc->head = n;
		k_wc->tail = 0;
		cq->kqueue = k_wc;
	}
Jason Gunthorpe Feb. 1, 2019, 8:42 p.m. UTC | #5
On Fri, Feb 01, 2019 at 08:21:35PM +0000, Arumugam, Kamenee wrote:
> 
> > > I think you should change the definition to be:
> > 
> > > #define RDMA_ATOMIC_UAPI(_type, _name) struct {_type atomic_val;} 
> > > _name
> > 
> > > And then add a kernel function
> > 
> > > #define RDMA_READ_UAPI_ATOMIC(_member) 
> > > READ_ONCE((_memb)->atomic_val)
> > 
> > > So that the compiler protects against this mistake.
> > 
> > There is already a producer and consumer lock for the circular buffer 
> > and only reading the opposition index needs the READ_ONCE control.
> 
> > No. 
> > Any read of memory mmap'd into user space must use READ_ONCE or risk a security problem due to control-flow execution corruption.
> 
> As per your suggestion above, I have draft a quick prototype to
> understand and clear any assumption beforehand.  Below are 3 macros
> that will be introduce in the rvt-abi header file.  The
> RDMA_READ_UAPI_ATOMIC and RDMA_WRITE_UAPI_ATOMIC macro are only
> applied to struct mmaped to userspace.
> 
> #define RDMA_ATOMIC_UAPI(_type, _name)struct{ _type val;}_name
> #define RDMA_READ_UAPI_ATOMIC(member) READ_ONCE((member).val)
> #define RDMA_WRITE_UAPI_ATOMIC(member, x) WRITE_ONCE((member).val, x)

It seems like siw (and rxe) needs this too lets see if Bernard likes it, and if
so it should probably go in a commonish header.

> struct rvt_cq_wc {
> 	/* index of next entry to fill */
> 	RDMA_ATOMIC_UAPI(__u32, head);
> 	/* index of next ib_poll_cq() entry */
> 	RDMA_ATOMIC_UAPI(__u32, tail);
> 
> 	/* these are actually size ibcq.cqe + 1 */
> 	struct ib_uverbs_wc uqueue[];
> };
> 
> Here is the use case of RDMA_READ_UAPI_ATOMIC macro  in cq.c file.
> The READ_ONCE will be applied for any read on the indices of mmaped
> to user space.

> if (cq->ip) {
> 		u_wc = cq->queue;
> 		uqueue = &u_wc->uqueue[0];
> 		head = RDMA_READ_UAPI_ATOMIC(u_wc->head);
> 		tail = RDMA_READ_UAPI_ATOMIC(u_wc->tail);
> 	} else {
> 		k_wc = cq->kqueue;
> 		kqueue = &k_wc->kqueue[0];
> 		head = k_wc->head;
> 		tail = k_wc->tail;
> 	}
> 
> 
> >Same story for WRITE_ONCE.
> 
> Here is the usage for RDMA_WRITE_UAPI_ATOMIC macro. For cq enter, we
> kept smp_wmb() in the code as it is now and we didn't adopt it into
> this macro. This macro Is meant for compile barrier.

That should be using smp_store_release()/load_acquire(), not open
coding barriers.

When working with unlocked data like this it is important to be
careful with these things. READ/WRITE_ONCE are only correct if the
single value is required, if you are doing anything that has a
dependency on other values then you have to use acquire/release
semantics to ensure global ordering is achieved.

These mirror to the userspace stdatomic definitions, so userspace must
also use the proper pairs - ie a smp_loac_acquire() in kernel must be
paired with an atomic_store_explicit(memory_order_release) - and vice
versa

Similar for the READ/WRITE_ONCE but it should use
memory_order_relaxed.

Bernard, same expectation for siw.

Jason
Arumugam, Kamenee Feb. 5, 2019, 7:22 p.m. UTC | #6
>> Here is the usage for RDMA_WRITE_UAPI_ATOMIC macro. For cq enter, we 
>> kept smp_wmb() in the code as it is now and we didn't adopt it into 
> >this macro. This macro Is meant for compile barrier.

> That should be using smp_store_release()/load_acquire(), not open 
> coding barriers.
> 

We are concerned that the barrier() in __smp_store_release() isn't strong enough on all platforms.

> When working with unlocked data like this it is important to be 
> careful with these things. READ/WRITE_ONCE are only correct if the 
> single value is required, if you are doing anything that has a 
> dependency on other values then you have to use acquire/release 
> semantics to ensure global ordering is achieved.
> 

The smp_store_release() has a compile time assert that prevents the tearing.

I assume that is why you are suggesting the smp_store_release().

> These mirror to the userspace stdatomic definitions, so userspace must 
> also use the proper pairs - ie a smp_loac_acquire() in kernel must be 
> paired with an atomic_store_explicit(memory_order_release) - and vice 
> versa
> 
> Similar for the READ/WRITE_ONCE but it should use 
> memory_order_relaxed.
> 

So are you now suggesting to use the smp_load_acquire() instead of READ_ONCE()?
Jason Gunthorpe Feb. 5, 2019, 7:41 p.m. UTC | #7
On Tue, Feb 05, 2019 at 07:22:50PM +0000, Arumugam, Kamenee wrote:
> >> Here is the usage for RDMA_WRITE_UAPI_ATOMIC macro. For cq enter, we 
> >> kept smp_wmb() in the code as it is now and we didn't adopt it into 
> > >this macro. This macro Is meant for compile barrier.
> 
> > That should be using smp_store_release()/load_acquire(), not open 
> > coding barriers.
> > 
> We are concerned that the barrier() in __smp_store_release() isn't
> strong enough on all platforms.

The platform must provide barriers inside acquire/release that are
strong enough to meet the behavioral definition from the memory
model. 

Ie the acquire is guaranteed to see all stores executed by another CPU
prior to release.

I'm not sure what you mean by 'strong enough'

> > When working with unlocked data like this it is important to be 
> > careful with these things. READ/WRITE_ONCE are only correct if the 
> > single value is required, if you are doing anything that has a 
> > dependency on other values then you have to use acquire/release 
> > semantics to ensure global ordering is achieved.
>
> The smp_store_release() has a compile time assert that prevents the tearing.
>
> I assume that is why you are suggesting the smp_store_release().

No, not quite, there is a difference between reading the single value
and being able to observe any prior stores to other values.

> > These mirror to the userspace stdatomic definitions, so userspace must 
> > also use the proper pairs - ie a smp_loac_acquire() in kernel must be 
> > paired with an atomic_store_explicit(memory_order_release) - and vice 
> > versa
> > 
> > Similar for the READ/WRITE_ONCE but it should use 
> > memory_order_relaxed.
> > 
> 
> So are you now suggesting to use the smp_load_acquire() instead of READ_ONCE()?

It depends entirely on what your algorithm is doing. If you can use
relaxed ordering then READ/WITE_ONCE are OK. If you need to have
acquire/release semantics then you should use acquire/release.

An open coded barrier here is rarely a good idea.

Jason
Jason Gunthorpe Feb. 8, 2019, 4:32 p.m. UTC | #8
On Fri, Feb 08, 2019 at 05:17:39PM +0100, Bernard Metzler wrote:

>    I agree. In siw READ_ONCE/WRITE_ONCE is used, or the appropriate
>    macros, which add memory barriers before or after, as needed.
>    I currently clean up a few occurrences of explicit mb's.
>    Why do we need an extra macro like proposed RDMA_WRITE_UAPI_ATOMIC?
>    Is it to catch cases where one wants to introduce non-atomic values
>    to be shared?

I think this was to get some kind of type safety so the compiler would
complain if people touch the atomics wrongly. Maybe we don't need it.

Jason
diff mbox series

Patch

diff --git a/drivers/infiniband/hw/hfi1/qp.c b/drivers/infiniband/hw/hfi1/qp.c
index 5344e89..7e6d0a7 100644
--- a/drivers/infiniband/hw/hfi1/qp.c
+++ b/drivers/infiniband/hw/hfi1/qp.c
@@ -591,6 +591,34 @@  static int qp_idle(struct rvt_qp *qp)
 }
 
 /**
+ * ib_cq_tail - Return tail index of cq buffer
+ * @send_cq - The cq for send
+ *
+ * This is called in qp_iter_print to get tail
+ * of cq buffer.
+ */
+static u32 ib_cq_tail(struct ib_cq *send_cq)
+{
+	return ibcq_to_rvtcq(send_cq)->ip ?
+	       ibcq_to_rvtcq(send_cq)->queue->tail :
+	       ibcq_to_rvtcq(send_cq)->kqueue->tail;
+}
+
+/**
+ * ib_cq_head - Return head index of cq buffer
+ * @send_cq - The cq for send
+ *
+ * This is called in qp_iter_print to get head
+ * of cq buffer.
+ */
+static u32 ib_cq_head(struct ib_cq *send_cq)
+{
+	return ibcq_to_rvtcq(send_cq)->ip ?
+	       ibcq_to_rvtcq(send_cq)->queue->head :
+	       ibcq_to_rvtcq(send_cq)->kqueue->head;
+}
+
+/**
  * qp_iter_print - print the qp information to seq_file
  * @s: the seq_file to emit the qp information on
  * @iter: the iterator for the qp hash list
@@ -650,8 +678,8 @@  void qp_iter_print(struct seq_file *s, struct rvt_qp_iter *iter)
 		   sde ? sde->this_idx : 0,
 		   send_context,
 		   send_context ? send_context->sw_index : 0,
-		   ibcq_to_rvtcq(qp->ibqp.send_cq)->queue->head,
-		   ibcq_to_rvtcq(qp->ibqp.send_cq)->queue->tail,
+		   ib_cq_head(qp->ibqp.send_cq),
+		   ib_cq_tail(qp->ibqp.send_cq),
 		   qp->pid,
 		   qp->s_state,
 		   qp->s_ack_state,
diff --git a/drivers/infiniband/sw/rdmavt/cq.c b/drivers/infiniband/sw/rdmavt/cq.c
index 4f1544a..31c1d68 100644
--- a/drivers/infiniband/sw/rdmavt/cq.c
+++ b/drivers/infiniband/sw/rdmavt/cq.c
@@ -63,19 +63,33 @@ 
  */
 void rvt_cq_enter(struct rvt_cq *cq, struct ib_wc *entry, bool solicited)
 {
-	struct rvt_cq_wc *wc;
+	struct ib_uverbs_wc *uqueue = NULL;
+	struct ib_wc *kqueue = NULL;
+	struct rvt_cq_wc *u_wc = NULL;
+	struct rvt_k_cq_wc *k_wc = NULL;
 	unsigned long flags;
 	u32 head;
 	u32 next;
+	u32 tail;
 
 	spin_lock_irqsave(&cq->lock, flags);
 
+	if (cq->ip) {
+		u_wc = cq->queue;
+		uqueue = &u_wc->uqueue[0];
+		head = u_wc->head;
+		tail = u_wc->tail;
+	} else {
+		k_wc = cq->kqueue;
+		kqueue = &k_wc->kqueue[0];
+		head = k_wc->head;
+		tail = k_wc->tail;
+	}
+
 	/*
-	 * Note that the head pointer might be writable by user processes.
-	 * Take care to verify it is a sane value.
+	 * Note that the head pointer might be writable by
+	 * user processes.Take care to verify it is a sane value.
 	 */
-	wc = cq->queue;
-	head = wc->head;
 	if (head >= (unsigned)cq->ibcq.cqe) {
 		head = cq->ibcq.cqe;
 		next = 0;
@@ -83,7 +97,7 @@  void rvt_cq_enter(struct rvt_cq *cq, struct ib_wc *entry, bool solicited)
 		next = head + 1;
 	}
 
-	if (unlikely(next == wc->tail)) {
+	if (unlikely(next == tail)) {
 		spin_unlock_irqrestore(&cq->lock, flags);
 		if (cq->ibcq.event_handler) {
 			struct ib_event ev;
@@ -96,27 +110,28 @@  void rvt_cq_enter(struct rvt_cq *cq, struct ib_wc *entry, bool solicited)
 		return;
 	}
 	trace_rvt_cq_enter(cq, entry, head);
-	if (cq->ip) {
-		wc->uqueue[head].wr_id = entry->wr_id;
-		wc->uqueue[head].status = entry->status;
-		wc->uqueue[head].opcode = entry->opcode;
-		wc->uqueue[head].vendor_err = entry->vendor_err;
-		wc->uqueue[head].byte_len = entry->byte_len;
-		wc->uqueue[head].ex.imm_data = entry->ex.imm_data;
-		wc->uqueue[head].qp_num = entry->qp->qp_num;
-		wc->uqueue[head].src_qp = entry->src_qp;
-		wc->uqueue[head].wc_flags = entry->wc_flags;
-		wc->uqueue[head].pkey_index = entry->pkey_index;
-		wc->uqueue[head].slid = ib_lid_cpu16(entry->slid);
-		wc->uqueue[head].sl = entry->sl;
-		wc->uqueue[head].dlid_path_bits = entry->dlid_path_bits;
-		wc->uqueue[head].port_num = entry->port_num;
+	if (uqueue) {
+		uqueue[head].wr_id = entry->wr_id;
+		uqueue[head].status = entry->status;
+		uqueue[head].opcode = entry->opcode;
+		uqueue[head].vendor_err = entry->vendor_err;
+		uqueue[head].byte_len = entry->byte_len;
+		uqueue[head].ex.imm_data = entry->ex.imm_data;
+		uqueue[head].qp_num = entry->qp->qp_num;
+		uqueue[head].src_qp = entry->src_qp;
+		uqueue[head].wc_flags = entry->wc_flags;
+		uqueue[head].pkey_index = entry->pkey_index;
+		uqueue[head].slid = ib_lid_cpu16(entry->slid);
+		uqueue[head].sl = entry->sl;
+		uqueue[head].dlid_path_bits = entry->dlid_path_bits;
+		uqueue[head].port_num = entry->port_num;
 		/* Make sure entry is written before the head index. */
 		smp_wmb();
+		u_wc->head = next;
 	} else {
-		wc->kqueue[head] = *entry;
+		kqueue[head] = *entry;
+		k_wc->head = next;
 	}
-	wc->head = next;
 
 	if (cq->notify == IB_CQ_NEXT_COMP ||
 	    (cq->notify == IB_CQ_SOLICITED &&
@@ -183,7 +198,8 @@  struct ib_cq *rvt_create_cq(struct ib_device *ibdev,
 {
 	struct rvt_dev_info *rdi = ib_to_rvt(ibdev);
 	struct rvt_cq *cq;
-	struct rvt_cq_wc *wc;
+	struct rvt_cq_wc *u_wc = NULL;
+	struct rvt_k_cq_wc *k_wc = NULL;
 	struct ib_cq *ret;
 	u32 sz;
 	unsigned int entries = attr->cqe;
@@ -212,17 +228,22 @@  struct ib_cq *rvt_create_cq(struct ib_device *ibdev,
 	 * We need to use vmalloc() in order to support mmap and large
 	 * numbers of entries.
 	 */
-	sz = sizeof(*wc);
-	if (udata && udata->outlen >= sizeof(__u64))
-		sz += sizeof(struct ib_uverbs_wc) * (entries + 1);
-	else
-		sz += sizeof(struct ib_wc) * (entries + 1);
-	wc = udata ?
-		vmalloc_user(sz) :
-		vzalloc_node(sz, rdi->dparms.node);
-	if (!wc) {
-		ret = ERR_PTR(-ENOMEM);
-		goto bail_cq;
+	if (udata && udata->outlen >= sizeof(__u64)) {
+		sz = sizeof(struct ib_uverbs_wc) * (entries + 1);
+		sz += sizeof(*u_wc);
+		u_wc = vmalloc_user(sz);
+		if (!u_wc) {
+			ret = ERR_PTR(-ENOMEM);
+			goto bail_cq;
+		}
+	} else {
+		sz = sizeof(struct ib_wc) * (entries + 1);
+		sz += sizeof(*k_wc);
+		k_wc = vzalloc_node(sz, rdi->dparms.node);
+		if (!k_wc) {
+			ret = ERR_PTR(-ENOMEM);
+			goto bail_cq;
+		}
 	}
 
 	/*
@@ -232,7 +253,7 @@  struct ib_cq *rvt_create_cq(struct ib_device *ibdev,
 	if (udata && udata->outlen >= sizeof(__u64)) {
 		int err;
 
-		cq->ip = rvt_create_mmap_info(rdi, sz, context, wc);
+		cq->ip = rvt_create_mmap_info(rdi, sz, context, u_wc);
 		if (!cq->ip) {
 			ret = ERR_PTR(-ENOMEM);
 			goto bail_wc;
@@ -279,7 +300,10 @@  struct ib_cq *rvt_create_cq(struct ib_device *ibdev,
 	cq->notify = RVT_CQ_NONE;
 	spin_lock_init(&cq->lock);
 	INIT_WORK(&cq->comptask, send_complete);
-	cq->queue = wc;
+	if (u_wc)
+		cq->queue = u_wc;
+	else
+		cq->kqueue = k_wc;
 
 	ret = &cq->ibcq;
 
@@ -289,7 +313,8 @@  struct ib_cq *rvt_create_cq(struct ib_device *ibdev,
 bail_ip:
 	kfree(cq->ip);
 bail_wc:
-	vfree(wc);
+	vfree(u_wc);
+	vfree(k_wc);
 bail_cq:
 	kfree(cq);
 done:
@@ -346,9 +371,15 @@  int rvt_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags notify_flags)
 	if (cq->notify != IB_CQ_NEXT_COMP)
 		cq->notify = notify_flags & IB_CQ_SOLICITED_MASK;
 
-	if ((notify_flags & IB_CQ_REPORT_MISSED_EVENTS) &&
-	    cq->queue->head != cq->queue->tail)
-		ret = 1;
+	if (notify_flags & IB_CQ_REPORT_MISSED_EVENTS) {
+		if (cq->queue) {
+			if (cq->queue->head != cq->queue->tail)
+				ret = 1;
+		} else {
+			if (cq->kqueue->head != cq->kqueue->tail)
+				ret = 1;
+		}
+	}
 
 	spin_unlock_irqrestore(&cq->lock, flags);
 
@@ -364,12 +395,14 @@  int rvt_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags notify_flags)
 int rvt_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata)
 {
 	struct rvt_cq *cq = ibcq_to_rvtcq(ibcq);
-	struct rvt_cq_wc *old_wc;
-	struct rvt_cq_wc *wc;
 	u32 head, tail, n;
 	int ret;
 	u32 sz;
 	struct rvt_dev_info *rdi = cq->rdi;
+	struct rvt_cq_wc *u_wc = NULL;
+	struct rvt_cq_wc *old_u_wc = NULL;
+	struct rvt_k_cq_wc *k_wc = NULL;
+	struct rvt_k_cq_wc *old_k_wc = NULL;
 
 	if (cqe < 1 || cqe > rdi->dparms.props.max_cqe)
 		return -EINVAL;
@@ -377,17 +410,19 @@  int rvt_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata)
 	/*
 	 * Need to use vmalloc() if we want to support large #s of entries.
 	 */
-	sz = sizeof(*wc);
-	if (udata && udata->outlen >= sizeof(__u64))
-		sz += sizeof(struct ib_uverbs_wc) * (cqe + 1);
-	else
-		sz += sizeof(struct ib_wc) * (cqe + 1);
-	wc = udata ?
-		vmalloc_user(sz) :
-		vzalloc_node(sz, rdi->dparms.node);
-	if (!wc)
-		return -ENOMEM;
-
+	if (udata && udata->outlen >= sizeof(__u64)) {
+		sz = sizeof(struct ib_uverbs_wc) * (cqe + 1);
+		sz += sizeof(*u_wc);
+		u_wc = vmalloc_user(sz);
+		if (!u_wc)
+			return -ENOMEM;
+	} else {
+		sz = sizeof(struct ib_wc) * (cqe + 1);
+		sz += sizeof(*k_wc);
+		k_wc = vzalloc_node(sz, rdi->dparms.node);
+		if (!k_wc)
+			return -ENOMEM;
+	}
 	/* Check that we can write the offset to mmap. */
 	if (udata && udata->outlen >= sizeof(__u64)) {
 		__u64 offset = 0;
@@ -402,11 +437,18 @@  int rvt_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata)
 	 * Make sure head and tail are sane since they
 	 * might be user writable.
 	 */
-	old_wc = cq->queue;
-	head = old_wc->head;
+	if (u_wc) {
+		old_u_wc = cq->queue;
+		head = old_u_wc->head;
+		tail = old_u_wc->tail;
+	} else {
+		old_k_wc = cq->kqueue;
+		head = old_k_wc->head;
+		tail = old_k_wc->tail;
+	}
+
 	if (head > (u32)cq->ibcq.cqe)
 		head = (u32)cq->ibcq.cqe;
-	tail = old_wc->tail;
 	if (tail > (u32)cq->ibcq.cqe)
 		tail = (u32)cq->ibcq.cqe;
 	if (head < tail)
@@ -418,27 +460,36 @@  int rvt_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata)
 		goto bail_unlock;
 	}
 	for (n = 0; tail != head; n++) {
-		if (cq->ip)
-			wc->uqueue[n] = old_wc->uqueue[tail];
+		if (u_wc)
+			u_wc->uqueue[n] = old_u_wc->uqueue[tail];
 		else
-			wc->kqueue[n] = old_wc->kqueue[tail];
+			k_wc->kqueue[n] = old_k_wc->kqueue[tail];
 		if (tail == (u32)cq->ibcq.cqe)
 			tail = 0;
 		else
 			tail++;
 	}
 	cq->ibcq.cqe = cqe;
-	wc->head = n;
-	wc->tail = 0;
-	cq->queue = wc;
+	if (u_wc) {
+		u_wc->head = n;
+		u_wc->tail = 0;
+		cq->queue = u_wc;
+	} else {
+		k_wc->head = n;
+		k_wc->tail = 0;
+		cq->kqueue = k_wc;
+	}
 	spin_unlock_irq(&cq->lock);
 
-	vfree(old_wc);
+	if (u_wc)
+		vfree(old_u_wc);
+	else
+		vfree(old_k_wc);
 
 	if (cq->ip) {
 		struct rvt_mmap_info *ip = cq->ip;
 
-		rvt_update_mmap_info(rdi, ip, sz, wc);
+		rvt_update_mmap_info(rdi, ip, sz, u_wc);
 
 		/*
 		 * Return the offset to mmap.
@@ -462,7 +513,9 @@  int rvt_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata)
 bail_unlock:
 	spin_unlock_irq(&cq->lock);
 bail_free:
-	vfree(wc);
+	vfree(u_wc);
+	vfree(k_wc);
+
 	return ret;
 }
 
@@ -480,7 +533,7 @@  int rvt_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata)
 int rvt_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry)
 {
 	struct rvt_cq *cq = ibcq_to_rvtcq(ibcq);
-	struct rvt_cq_wc *wc;
+	struct rvt_k_cq_wc *wc;
 	unsigned long flags;
 	int npolled;
 	u32 tail;
@@ -491,7 +544,7 @@  int rvt_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry)
 
 	spin_lock_irqsave(&cq->lock, flags);
 
-	wc = cq->queue;
+	wc = cq->kqueue;
 	tail = wc->tail;
 	if (tail > (u32)cq->ibcq.cqe)
 		tail = (u32)cq->ibcq.cqe;
diff --git a/include/rdma/rdmavt_cq.h b/include/rdma/rdmavt_cq.h
index 75dc65c..9f25be0 100644
--- a/include/rdma/rdmavt_cq.h
+++ b/include/rdma/rdmavt_cq.h
@@ -59,20 +59,17 @@ 
  * notifications are armed.
  */
 #define RVT_CQ_NONE      (IB_CQ_NEXT_COMP + 1)
+#include <rdma/rvt-abi.h>
 
 /*
  * This structure is used to contain the head pointer, tail pointer,
  * and completion queue entries as a single memory allocation so
  * it can be mmap'ed into user space.
  */
-struct rvt_cq_wc {
+struct rvt_k_cq_wc {
 	u32 head;               /* index of next entry to fill */
 	u32 tail;               /* index of next ib_poll_cq() entry */
-	union {
-		/* these are actually size ibcq.cqe + 1 */
-		struct ib_uverbs_wc uqueue[0];
-		struct ib_wc kqueue[0];
-	};
+	struct ib_wc kqueue[0];
 };
 
 /*
@@ -88,6 +85,7 @@  struct rvt_cq {
 	struct rvt_dev_info *rdi;
 	struct rvt_cq_wc *queue;
 	struct rvt_mmap_info *ip;
+	struct rvt_k_cq_wc *kqueue;
 };
 
 static inline struct rvt_cq *ibcq_to_rvtcq(struct ib_cq *ibcq)
diff --git a/include/uapi/rdma/rvt-abi.h b/include/uapi/rdma/rvt-abi.h
new file mode 100644
index 0000000..c9ad6fa
--- /dev/null
+++ b/include/uapi/rdma/rvt-abi.h
@@ -0,0 +1,31 @@ 
+/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
+
+/*
+ * This file contains defines, structures, etc. that are used
+ * to communicate between kernel and user code.
+ */
+
+#ifndef RVT_ABI_USER_H
+#define RVT_ABI_USER_H
+
+#include <linux/types.h>
+#include <rdma/ib_user_verbs.h>
+#ifndef RDMA_ATOMIC_UAPI
+#define RDMA_ATOMIC_UAPI(_type, _name) _type _name
+#endif
+/*
+ * This structure is used to contain the head pointer, tail pointer,
+ * and completion queue entries as a single memory allocation so
+ * it can be mmap'ed into user space.
+ */
+struct rvt_cq_wc {
+	/* index of next entry to fill */
+	RDMA_ATOMIC_UAPI(u32, head);
+	/* index of next ib_poll_cq() entry */
+	RDMA_ATOMIC_UAPI(u32, tail);
+
+	/* these are actually size ibcq.cqe + 1 */
+	struct ib_uverbs_wc uqueue[0];
+};
+
+#endif /* RVT_ABI_USER_H */