diff mbox

[V3,1/5] RDMA/core: Transport-independent access flags

Message ID 20150705232158.12029.25472.stgit@build2.ogc.int (mailing list archive)
State New, archived
Headers show

Commit Message

Steve Wise July 5, 2015, 11:22 p.m. UTC
The semantics for MR access flags are not consistent across RDMA
protocols.  So rather than have applications try and glean what they
need, have them pass in the intended roles and attributes for the MR to
be allocated and let the RDMA core select the appropriate access flags
given the roles, attributes, and device capabilities.

We introduce rdma_mr_roles and rdma_mr_attributes that enumerate the
possible roles and attributes for a MR.  These are documented in the
enums themselves.

New services exported:

rdma_device_access_flags() - given the intended MR roles and attributes
passed in, return the ib_access_flags bits for the device.

rdma_get_dma_mr() - allocate a dma mr using the applications intended
MR roles and MR attributes.  This uses rdma_device_access_flags().

rdma_fast_reg_access_flags() - return the ib_access_flags bits needed
for a fast register WR given the applications intended MR roles and
MR attributes.  This uses rdma_device_access_flags().

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---

 drivers/infiniband/core/verbs.c |   30 +++++++++++
 include/rdma/ib_verbs.h         |  106 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 136 insertions(+), 0 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Christoph Hellwig July 6, 2015, 5:25 a.m. UTC | #1
On Sun, Jul 05, 2015 at 06:22:00PM -0500, Steve Wise wrote:
> The semantics for MR access flags are not consistent across RDMA
> protocols.  So rather than have applications try and glean what they
> need, have them pass in the intended roles and attributes for the MR to
> be allocated and let the RDMA core select the appropriate access flags
> given the roles, attributes, and device capabilities.
> 
> We introduce rdma_mr_roles and rdma_mr_attributes that enumerate the
> possible roles and attributes for a MR.  These are documented in the
> enums themselves.
> 
> New services exported:
> 
> rdma_device_access_flags() - given the intended MR roles and attributes
> passed in, return the ib_access_flags bits for the device.
> 
> rdma_get_dma_mr() - allocate a dma mr using the applications intended
> MR roles and MR attributes.  This uses rdma_device_access_flags().
> 
> rdma_fast_reg_access_flags() - return the ib_access_flags bits needed
> for a fast register WR given the applications intended MR roles and
> MR attributes.  This uses rdma_device_access_flags().
> 
> Signed-off-by: Steve Wise <swise@opengridcomputing.com>
> ---
> 
>  drivers/infiniband/core/verbs.c |   30 +++++++++++
>  include/rdma/ib_verbs.h         |  106 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 136 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> index bac3fb4..f42595c 100644
> --- a/drivers/infiniband/core/verbs.c
> +++ b/drivers/infiniband/core/verbs.c
> @@ -1144,6 +1144,36 @@ struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags)
>  }
>  EXPORT_SYMBOL(ib_get_dma_mr);
>  
> +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs)
> +{
> +	int access_flags = attrs;

Please add an assert for the values that are allowed for attrs.

It also would be highly useful to add a kerneldoc comment describing
the function and the parameters.  Also __bitwise sparse tricks
to ensure the right flags are passed wouldn't be a too bad idea.

> +/**
> + * rdma_mr_attributes - attributes for rdma memory regions
> + */
> +enum {
> +	RDMA_MRA_ZERO_BASED		= IB_ZERO_BASED,
> +	RDMA_MRA_ACCESS_ON_DEMAND	= IB_ACCESS_ON_DEMAND,
> +};

Why not specfiy these as separate nuerical namespace?

> +/**
> + * rdma_device_access_flags - Returns the device-specific MR access flags.
> + * @pd: The protection domain associated with the memory region.
> + * @roles: The intended roles of the MR
> + * @attrs: The desired attributes of the MR
> + *
> + * Use the intended roles from @roles along with @attrs and the device
> + * capabilities to generate the needed access rights.
> + *
> + * Return: the ib_access_flags value needed to allocate the MR.
> + */
> +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs);

Oh, here we have kerneldoc comments, just in the wrong place.  Please
move them to the function defintion.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Haggai Eran July 6, 2015, 7:09 a.m. UTC | #2
On 06/07/2015 02:22, Steve Wise wrote:
> +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs)
> +{
> +	int access_flags = attrs;
> +
> +	if (roles & RDMA_MRR_RECV)
> +		access_flags |= IB_ACCESS_LOCAL_WRITE;
> +
> +	if (roles & RDMA_MRR_WRITE_DEST)
> +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
> +
> +	if (roles & RDMA_MRR_READ_DEST) {
> +		access_flags |= IB_ACCESS_LOCAL_WRITE;
> +		if (rdma_protocol_iwarp(pd->device,
> +					rdma_start_port(pd->device)))
> +			access_flags |= IB_ACCESS_REMOTE_WRITE;
> +	}
> +
> +	if (roles & RDMA_MRR_READ_SOURCE)
> +		access_flags |= IB_ACCESS_REMOTE_READ;
> +
> +	if (roles & RDMA_MRR_ATOMIC_DEST)
> +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC;

I think you need LOCAL_WRITE for ATOMIC_SOURCE in order to receive the
results of the atomic operation.

> +
> +	if (roles & RDMA_MRR_MW_BIND)
> +		access_flags |= IB_ACCESS_MW_BIND;
> +
> +	return access_flags;
> +}
> +EXPORT_SYMBOL(rdma_device_access_flags);

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 6, 2015, 7:53 a.m. UTC | #3
On 7/6/2015 2:22 AM, Steve Wise wrote:
> The semantics for MR access flags are not consistent across RDMA
> protocols.  So rather than have applications try and glean what they
> need, have them pass in the intended roles and attributes for the MR to
> be allocated and let the RDMA core select the appropriate access flags
> given the roles, attributes, and device capabilities.
>
> We introduce rdma_mr_roles and rdma_mr_attributes that enumerate the
> possible roles and attributes for a MR.  These are documented in the
> enums themselves.
>
> New services exported:
>
> rdma_device_access_flags() - given the intended MR roles and attributes
> passed in, return the ib_access_flags bits for the device.
>
> rdma_get_dma_mr() - allocate a dma mr using the applications intended
> MR roles and MR attributes.  This uses rdma_device_access_flags().
>
> rdma_fast_reg_access_flags() - return the ib_access_flags bits needed
> for a fast register WR given the applications intended MR roles and
> MR attributes.  This uses rdma_device_access_flags().
>
> Signed-off-by: Steve Wise <swise@opengridcomputing.com>
> ---
>
>   drivers/infiniband/core/verbs.c |   30 +++++++++++
>   include/rdma/ib_verbs.h         |  106 +++++++++++++++++++++++++++++++++++++++
>   2 files changed, 136 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> index bac3fb4..f42595c 100644
> --- a/drivers/infiniband/core/verbs.c
> +++ b/drivers/infiniband/core/verbs.c
> @@ -1144,6 +1144,36 @@ struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags)
>   }
>   EXPORT_SYMBOL(ib_get_dma_mr);
>
> +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs)
> +{
> +	int access_flags = attrs;
> +
> +	if (roles & RDMA_MRR_RECV)
> +		access_flags |= IB_ACCESS_LOCAL_WRITE;
> +
> +	if (roles & RDMA_MRR_WRITE_DEST)
> +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
> +
> +	if (roles & RDMA_MRR_READ_DEST) {
> +		access_flags |= IB_ACCESS_LOCAL_WRITE;
> +		if (rdma_protocol_iwarp(pd->device,
> +					rdma_start_port(pd->device)))
> +			access_flags |= IB_ACCESS_REMOTE_WRITE;
> +	}
> +
> +	if (roles & RDMA_MRR_READ_SOURCE)
> +		access_flags |= IB_ACCESS_REMOTE_READ;
> +
> +	if (roles & RDMA_MRR_ATOMIC_DEST)
> +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC;
> +
> +	if (roles & RDMA_MRR_MW_BIND)
> +		access_flags |= IB_ACCESS_MW_BIND;
> +
> +	return access_flags;
> +}
> +EXPORT_SYMBOL(rdma_device_access_flags);
> +
>   struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd,
>   			     struct ib_phys_buf *phys_buf_array,
>   			     int num_phys_buf,
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 986fddb..da1eadf 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -2494,6 +2494,112 @@ static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt)
>   struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags);
>
>   /**
> + * rdma_mr_roles - possible roles an RDMA MR will be used for
> + *
> + * This allows a transport independent RDMA application to
> + * create MRs that are usable for all the desired roles w/o
> + * having to understand which access rights are needed.
> + */
> +enum {
> +
> +	/* lkey used in a ib_recv_wr sge */
> +	RDMA_MRR_RECV			= 1,
> +
> +	/* lkey used for IB_WR_SEND* ops in the ib_send_wr sge */
> +	RDMA_MRR_SEND			= (1<<1),
> +
> +	/* rkey used for IB_WR_RDMA_READ* ops in ib_send_wr wr.rdma.rkey */
> +	RDMA_MRR_READ_SOURCE		= (1<<2),
> +
> +	/* lkey used for IB_WR_RDMA_READ* ops in the ib_send_wr sge */
> +	RDMA_MRR_READ_DEST		= (1<<3),
> +
> +	/* lkey used for IB_WR_RDMA_WRITE* ops in the ib_send_wr sge */
> +	RDMA_MRR_WRITE_SOURCE		= (1<<4),
> +
> +	/* rkey used for IB_WR_RDMA_WRITE* ops in ib_send_wr wr.rdma.rkey */
> +	RDMA_MRR_WRITE_DEST		= (1<<5),
> +
> +	/*
> +	 * lkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in the ib_send_wr sge
> +	 */
> +	RDMA_MRR_ATOMIC_SOURCE		= (1<<6),
> +
> +	/*
> +	 * rkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in ib_send_wr
> +	 * wr.atomic.rkey
> +	 */
> +	RDMA_MRR_ATOMIC_DEST		= (1<<7),
> +
> +	/* MR used for a IB_WR_MW_BIND in ib_send_wr wr.bind_mw.bind_info.mr */
> +	RDMA_MRR_MW_BIND		= (1<<8),
> +};
> +
> +/**
> + * rdma_mr_attributes - attributes for rdma memory regions
> + */
> +enum {
> +	RDMA_MRA_ZERO_BASED		= IB_ZERO_BASED,
> +	RDMA_MRA_ACCESS_ON_DEMAND	= IB_ACCESS_ON_DEMAND,
> +};
> +
> +/**
> + * rdma_device_access_flags - Returns the device-specific MR access flags.
> + * @pd: The protection domain associated with the memory region.
> + * @roles: The intended roles of the MR
> + * @attrs: The desired attributes of the MR
> + *
> + * Use the intended roles from @roles along with @attrs and the device
> + * capabilities to generate the needed access rights.
> + *
> + * Return: the ib_access_flags value needed to allocate the MR.
> + */
> +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs);
> +
> +/**
> + * rdma_get_dma_mr - Returns a memory region for system memory that is
> + * usable for DMA.
> + * @pd: The protection domain associated with the memory region.
> + * @roles: The intended roles of the MR
> + * @attrs: The desired attributes of the MR
> + *
> + * Use the intended roles from @roles along with @attrs and the device
> + * capabilities to define the needed access rights, and call
> + * ib_get_dma_mr() to allocate the MR.
> + *
> + * Note that the ib_dma_*() functions defined below must be used
> + * to create/destroy addresses used with the Lkey or Rkey returned
> + * by ib_get_dma_mr().
> + *
> + * Return: struct ib_mr pointer upon success, or a pointer encoded errno upon
> + * failure.
> + */
> +static inline struct ib_mr *rdma_get_dma_mr(struct ib_pd *pd, int roles,
> +					    int attrs)
> +{
> +	return ib_get_dma_mr(pd, rdma_device_access_flags(pd, roles, attrs));
> +}
> +

I still think this wrapper should go away...
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 6, 2015, 7:58 a.m. UTC | #4
On 7/6/2015 2:22 AM, Steve Wise wrote:
> The semantics for MR access flags are not consistent across RDMA
> protocols.  So rather than have applications try and glean what they
> need, have them pass in the intended roles and attributes for the MR to
> be allocated and let the RDMA core select the appropriate access flags
> given the roles, attributes, and device capabilities.
>
> We introduce rdma_mr_roles and rdma_mr_attributes that enumerate the
> possible roles and attributes for a MR.  These are documented in the
> enums themselves.
>
> New services exported:
>
> rdma_device_access_flags() - given the intended MR roles and attributes
> passed in, return the ib_access_flags bits for the device.
>
> rdma_get_dma_mr() - allocate a dma mr using the applications intended
> MR roles and MR attributes.  This uses rdma_device_access_flags().
>
> rdma_fast_reg_access_flags() - return the ib_access_flags bits needed
> for a fast register WR given the applications intended MR roles and
> MR attributes.  This uses rdma_device_access_flags().
>
> Signed-off-by: Steve Wise <swise@opengridcomputing.com>
> ---
>
>   drivers/infiniband/core/verbs.c |   30 +++++++++++
>   include/rdma/ib_verbs.h         |  106 +++++++++++++++++++++++++++++++++++++++
>   2 files changed, 136 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> index bac3fb4..f42595c 100644
> --- a/drivers/infiniband/core/verbs.c
> +++ b/drivers/infiniband/core/verbs.c
> @@ -1144,6 +1144,36 @@ struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags)
>   }
>   EXPORT_SYMBOL(ib_get_dma_mr);
>
> +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs)
> +{
> +	int access_flags = attrs;
> +
> +	if (roles & RDMA_MRR_RECV)
> +		access_flags |= IB_ACCESS_LOCAL_WRITE;
> +
> +	if (roles & RDMA_MRR_WRITE_DEST)
> +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
> +
> +	if (roles & RDMA_MRR_READ_DEST) {
> +		access_flags |= IB_ACCESS_LOCAL_WRITE;
> +		if (rdma_protocol_iwarp(pd->device,
> +					rdma_start_port(pd->device)))
> +			access_flags |= IB_ACCESS_REMOTE_WRITE;
> +	}
> +
> +	if (roles & RDMA_MRR_READ_SOURCE)
> +		access_flags |= IB_ACCESS_REMOTE_READ;
> +
> +	if (roles & RDMA_MRR_ATOMIC_DEST)
> +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC;
> +
> +	if (roles & RDMA_MRR_MW_BIND)
> +		access_flags |= IB_ACCESS_MW_BIND;
> +
> +	return access_flags;
> +}
> +EXPORT_SYMBOL(rdma_device_access_flags);
> +
>   struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd,
>   			     struct ib_phys_buf *phys_buf_array,
>   			     int num_phys_buf,
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 986fddb..da1eadf 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -2494,6 +2494,112 @@ static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt)
>   struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags);
>
>   /**
> + * rdma_mr_roles - possible roles an RDMA MR will be used for
> + *
> + * This allows a transport independent RDMA application to
> + * create MRs that are usable for all the desired roles w/o
> + * having to understand which access rights are needed.
> + */
> +enum {
> +
> +	/* lkey used in a ib_recv_wr sge */
> +	RDMA_MRR_RECV			= 1,
> +
> +	/* lkey used for IB_WR_SEND* ops in the ib_send_wr sge */
> +	RDMA_MRR_SEND			= (1<<1),
> +
> +	/* rkey used for IB_WR_RDMA_READ* ops in ib_send_wr wr.rdma.rkey */
> +	RDMA_MRR_READ_SOURCE		= (1<<2),
> +
> +	/* lkey used for IB_WR_RDMA_READ* ops in the ib_send_wr sge */
> +	RDMA_MRR_READ_DEST		= (1<<3),
> +
> +	/* lkey used for IB_WR_RDMA_WRITE* ops in the ib_send_wr sge */
> +	RDMA_MRR_WRITE_SOURCE		= (1<<4),
> +
> +	/* rkey used for IB_WR_RDMA_WRITE* ops in ib_send_wr wr.rdma.rkey */
> +	RDMA_MRR_WRITE_DEST		= (1<<5),
> +
> +	/*
> +	 * lkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in the ib_send_wr sge
> +	 */
> +	RDMA_MRR_ATOMIC_SOURCE		= (1<<6),
> +
> +	/*
> +	 * rkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in ib_send_wr
> +	 * wr.atomic.rkey
> +	 */
> +	RDMA_MRR_ATOMIC_DEST		= (1<<7),
> +
> +	/* MR used for a IB_WR_MW_BIND in ib_send_wr wr.bind_mw.bind_info.mr */
> +	RDMA_MRR_MW_BIND		= (1<<8),
> +};
> +
> +/**
> + * rdma_mr_attributes - attributes for rdma memory regions
> + */
> +enum {
> +	RDMA_MRA_ZERO_BASED		= IB_ZERO_BASED,
> +	RDMA_MRA_ACCESS_ON_DEMAND	= IB_ACCESS_ON_DEMAND,
> +};
> +
> +/**
> + * rdma_device_access_flags - Returns the device-specific MR access flags.
> + * @pd: The protection domain associated with the memory region.
> + * @roles: The intended roles of the MR
> + * @attrs: The desired attributes of the MR
> + *
> + * Use the intended roles from @roles along with @attrs and the device
> + * capabilities to generate the needed access rights.
> + *
> + * Return: the ib_access_flags value needed to allocate the MR.
> + */
> +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs);
> +
> +/**
> + * rdma_get_dma_mr - Returns a memory region for system memory that is
> + * usable for DMA.
> + * @pd: The protection domain associated with the memory region.
> + * @roles: The intended roles of the MR
> + * @attrs: The desired attributes of the MR
> + *
> + * Use the intended roles from @roles along with @attrs and the device
> + * capabilities to define the needed access rights, and call
> + * ib_get_dma_mr() to allocate the MR.
> + *
> + * Note that the ib_dma_*() functions defined below must be used
> + * to create/destroy addresses used with the Lkey or Rkey returned
> + * by ib_get_dma_mr().
> + *
> + * Return: struct ib_mr pointer upon success, or a pointer encoded errno upon
> + * failure.
> + */
> +static inline struct ib_mr *rdma_get_dma_mr(struct ib_pd *pd, int roles,
> +					    int attrs)
> +{
> +	return ib_get_dma_mr(pd, rdma_device_access_flags(pd, roles, attrs));
> +}
> +
> +/**
> + * rdma_fast_reg_access_flags - Returns the access flags needed for a fast
> + * register operation.
> + * @pd: The protection domain associated with the memory region.
> + * @roles: The intended roles of the MR
> + * @attrs: The desired attributes of the MR
> + *
> + * Use the intended roles from @roles along with @attrs and the device
> + * capabilities to define the needed access rights for a fast registration
> + * work request.
> + *
> + * Return: the ib_access_flags value needed for fast registration.
> + */
> +static inline int rdma_fast_reg_access_flags(struct ib_pd *pd, int roles,
> +					     int attrs)
> +{
> +	return rdma_device_access_flags(pd, roles, attrs);
> +}

Why do we have rdma_fast_reg_access_flags() that just trampolines to
rdma_device_access_flags()? why do we need it?
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise July 6, 2015, 2:23 p.m. UTC | #5
> -----Original Message-----
> From: Christoph Hellwig [mailto:hch@infradead.org]
> Sent: Monday, July 06, 2015 12:25 AM
> To: Steve Wise
> Cc: dledford@redhat.com; sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org;
> eli@mellanox.com; target-devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org
> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
> 
> On Sun, Jul 05, 2015 at 06:22:00PM -0500, Steve Wise wrote:
> > The semantics for MR access flags are not consistent across RDMA
> > protocols.  So rather than have applications try and glean what they
> > need, have them pass in the intended roles and attributes for the MR to
> > be allocated and let the RDMA core select the appropriate access flags
> > given the roles, attributes, and device capabilities.
> >
> > We introduce rdma_mr_roles and rdma_mr_attributes that enumerate the
> > possible roles and attributes for a MR.  These are documented in the
> > enums themselves.
> >
> > New services exported:
> >
> > rdma_device_access_flags() - given the intended MR roles and attributes
> > passed in, return the ib_access_flags bits for the device.
> >
> > rdma_get_dma_mr() - allocate a dma mr using the applications intended
> > MR roles and MR attributes.  This uses rdma_device_access_flags().
> >
> > rdma_fast_reg_access_flags() - return the ib_access_flags bits needed
> > for a fast register WR given the applications intended MR roles and
> > MR attributes.  This uses rdma_device_access_flags().
> >
> > Signed-off-by: Steve Wise <swise@opengridcomputing.com>
> > ---
> >
> >  drivers/infiniband/core/verbs.c |   30 +++++++++++
> >  include/rdma/ib_verbs.h         |  106 +++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 136 insertions(+), 0 deletions(-)
> >
> > diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> > index bac3fb4..f42595c 100644
> > --- a/drivers/infiniband/core/verbs.c
> > +++ b/drivers/infiniband/core/verbs.c
> > @@ -1144,6 +1144,36 @@ struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags)
> >  }
> >  EXPORT_SYMBOL(ib_get_dma_mr);
> >
> > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs)
> > +{
> > +	int access_flags = attrs;
> 
> Please add an assert for the values that are allowed for attrs.
> 
> It also would be highly useful to add a kerneldoc comment describing
> the function and the parameters.  Also __bitwise sparse tricks
> to ensure the right flags are passed wouldn't be a too bad idea.
> 

Can you explain the "__bitwise sparse tricks"?  Or point me to an example.  

> > +/**
> > + * rdma_mr_attributes - attributes for rdma memory regions
> > + */
> > +enum {
> > +	RDMA_MRA_ZERO_BASED		= IB_ZERO_BASED,
> > +	RDMA_MRA_ACCESS_ON_DEMAND	= IB_ACCESS_ON_DEMAND,
> > +};
> 
> Why not specfiy these as separate nuerical namespace?
> 

To avoid having to translate them. 

> > +/**
> > + * rdma_device_access_flags - Returns the device-specific MR access flags.
> > + * @pd: The protection domain associated with the memory region.
> > + * @roles: The intended roles of the MR
> > + * @attrs: The desired attributes of the MR
> > + *
> > + * Use the intended roles from @roles along with @attrs and the device
> > + * capabilities to generate the needed access rights.
> > + *
> > + * Return: the ib_access_flags value needed to allocate the MR.
> > + */
> > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs);
> 
> Oh, here we have kerneldoc comments, just in the wrong place.  Please
> move them to the function defintion.

Ok.


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise July 6, 2015, 2:29 p.m. UTC | #6
> -----Original Message-----
> From: Haggai Eran [mailto:haggaie@mellanox.com]
> Sent: Monday, July 06, 2015 2:09 AM
> To: Steve Wise; dledford@redhat.com
> Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-
> devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org
> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
> 
> On 06/07/2015 02:22, Steve Wise wrote:
> > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs)
> > +{
> > +	int access_flags = attrs;
> > +
> > +	if (roles & RDMA_MRR_RECV)
> > +		access_flags |= IB_ACCESS_LOCAL_WRITE;
> > +
> > +	if (roles & RDMA_MRR_WRITE_DEST)
> > +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
> > +
> > +	if (roles & RDMA_MRR_READ_DEST) {
> > +		access_flags |= IB_ACCESS_LOCAL_WRITE;
> > +		if (rdma_protocol_iwarp(pd->device,
> > +					rdma_start_port(pd->device)))
> > +			access_flags |= IB_ACCESS_REMOTE_WRITE;
> > +	}
> > +
> > +	if (roles & RDMA_MRR_READ_SOURCE)
> > +		access_flags |= IB_ACCESS_REMOTE_READ;
> > +
> > +	if (roles & RDMA_MRR_ATOMIC_DEST)
> > +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC;
> 
> I think you need LOCAL_WRITE for ATOMIC_SOURCE in order to receive the
> results of the atomic operation.
> 

Ok.  I'm not familiar with atomics.  I'll verify with the IB spec and update the code as needed.


> > +
> > +	if (roles & RDMA_MRR_MW_BIND)
> > +		access_flags |= IB_ACCESS_MW_BIND;
> > +
> > +	return access_flags;
> > +}
> > +EXPORT_SYMBOL(rdma_device_access_flags);

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise July 6, 2015, 2:37 p.m. UTC | #7
> -----Original Message-----
> From: Sagi Grimberg [mailto:sagig@dev.mellanox.co.il]
> Sent: Monday, July 06, 2015 2:54 AM
> To: Steve Wise; dledford@redhat.com
> Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-
> devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org
> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
> 
> On 7/6/2015 2:22 AM, Steve Wise wrote:
> > The semantics for MR access flags are not consistent across RDMA
> > protocols.  So rather than have applications try and glean what they
> > need, have them pass in the intended roles and attributes for the MR to
> > be allocated and let the RDMA core select the appropriate access flags
> > given the roles, attributes, and device capabilities.
> >
> > We introduce rdma_mr_roles and rdma_mr_attributes that enumerate the
> > possible roles and attributes for a MR.  These are documented in the
> > enums themselves.
> >
> > New services exported:
> >
> > rdma_device_access_flags() - given the intended MR roles and attributes
> > passed in, return the ib_access_flags bits for the device.
> >
> > rdma_get_dma_mr() - allocate a dma mr using the applications intended
> > MR roles and MR attributes.  This uses rdma_device_access_flags().
> >
> > rdma_fast_reg_access_flags() - return the ib_access_flags bits needed
> > for a fast register WR given the applications intended MR roles and
> > MR attributes.  This uses rdma_device_access_flags().
> >
> > Signed-off-by: Steve Wise <swise@opengridcomputing.com>
> > ---
> >
> >   drivers/infiniband/core/verbs.c |   30 +++++++++++
> >   include/rdma/ib_verbs.h         |  106 +++++++++++++++++++++++++++++++++++++++
> >   2 files changed, 136 insertions(+), 0 deletions(-)
> >
> > diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> > index bac3fb4..f42595c 100644
> > --- a/drivers/infiniband/core/verbs.c
> > +++ b/drivers/infiniband/core/verbs.c
> > @@ -1144,6 +1144,36 @@ struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags)
> >   }
> >   EXPORT_SYMBOL(ib_get_dma_mr);
> >
> > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs)
> > +{
> > +	int access_flags = attrs;
> > +
> > +	if (roles & RDMA_MRR_RECV)
> > +		access_flags |= IB_ACCESS_LOCAL_WRITE;
> > +
> > +	if (roles & RDMA_MRR_WRITE_DEST)
> > +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
> > +
> > +	if (roles & RDMA_MRR_READ_DEST) {
> > +		access_flags |= IB_ACCESS_LOCAL_WRITE;
> > +		if (rdma_protocol_iwarp(pd->device,
> > +					rdma_start_port(pd->device)))
> > +			access_flags |= IB_ACCESS_REMOTE_WRITE;
> > +	}
> > +
> > +	if (roles & RDMA_MRR_READ_SOURCE)
> > +		access_flags |= IB_ACCESS_REMOTE_READ;
> > +
> > +	if (roles & RDMA_MRR_ATOMIC_DEST)
> > +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC;
> > +
> > +	if (roles & RDMA_MRR_MW_BIND)
> > +		access_flags |= IB_ACCESS_MW_BIND;
> > +
> > +	return access_flags;
> > +}
> > +EXPORT_SYMBOL(rdma_device_access_flags);
> > +
> >   struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd,
> >   			     struct ib_phys_buf *phys_buf_array,
> >   			     int num_phys_buf,
> > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> > index 986fddb..da1eadf 100644
> > --- a/include/rdma/ib_verbs.h
> > +++ b/include/rdma/ib_verbs.h
> > @@ -2494,6 +2494,112 @@ static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt)
> >   struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags);
> >
> >   /**
> > + * rdma_mr_roles - possible roles an RDMA MR will be used for
> > + *
> > + * This allows a transport independent RDMA application to
> > + * create MRs that are usable for all the desired roles w/o
> > + * having to understand which access rights are needed.
> > + */
> > +enum {
> > +
> > +	/* lkey used in a ib_recv_wr sge */
> > +	RDMA_MRR_RECV			= 1,
> > +
> > +	/* lkey used for IB_WR_SEND* ops in the ib_send_wr sge */
> > +	RDMA_MRR_SEND			= (1<<1),
> > +
> > +	/* rkey used for IB_WR_RDMA_READ* ops in ib_send_wr wr.rdma.rkey */
> > +	RDMA_MRR_READ_SOURCE		= (1<<2),
> > +
> > +	/* lkey used for IB_WR_RDMA_READ* ops in the ib_send_wr sge */
> > +	RDMA_MRR_READ_DEST		= (1<<3),
> > +
> > +	/* lkey used for IB_WR_RDMA_WRITE* ops in the ib_send_wr sge */
> > +	RDMA_MRR_WRITE_SOURCE		= (1<<4),
> > +
> > +	/* rkey used for IB_WR_RDMA_WRITE* ops in ib_send_wr wr.rdma.rkey */
> > +	RDMA_MRR_WRITE_DEST		= (1<<5),
> > +
> > +	/*
> > +	 * lkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in the ib_send_wr sge
> > +	 */
> > +	RDMA_MRR_ATOMIC_SOURCE		= (1<<6),
> > +
> > +	/*
> > +	 * rkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in ib_send_wr
> > +	 * wr.atomic.rkey
> > +	 */
> > +	RDMA_MRR_ATOMIC_DEST		= (1<<7),
> > +
> > +	/* MR used for a IB_WR_MW_BIND in ib_send_wr wr.bind_mw.bind_info.mr */
> > +	RDMA_MRR_MW_BIND		= (1<<8),
> > +};
> > +
> > +/**
> > + * rdma_mr_attributes - attributes for rdma memory regions
> > + */
> > +enum {
> > +	RDMA_MRA_ZERO_BASED		= IB_ZERO_BASED,
> > +	RDMA_MRA_ACCESS_ON_DEMAND	= IB_ACCESS_ON_DEMAND,
> > +};
> > +
> > +/**
> > + * rdma_device_access_flags - Returns the device-specific MR access flags.
> > + * @pd: The protection domain associated with the memory region.
> > + * @roles: The intended roles of the MR
> > + * @attrs: The desired attributes of the MR
> > + *
> > + * Use the intended roles from @roles along with @attrs and the device
> > + * capabilities to generate the needed access rights.
> > + *
> > + * Return: the ib_access_flags value needed to allocate the MR.
> > + */
> > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs);
> > +
> > +/**
> > + * rdma_get_dma_mr - Returns a memory region for system memory that is
> > + * usable for DMA.
> > + * @pd: The protection domain associated with the memory region.
> > + * @roles: The intended roles of the MR
> > + * @attrs: The desired attributes of the MR
> > + *
> > + * Use the intended roles from @roles along with @attrs and the device
> > + * capabilities to define the needed access rights, and call
> > + * ib_get_dma_mr() to allocate the MR.
> > + *
> > + * Note that the ib_dma_*() functions defined below must be used
> > + * to create/destroy addresses used with the Lkey or Rkey returned
> > + * by ib_get_dma_mr().
> > + *
> > + * Return: struct ib_mr pointer upon success, or a pointer encoded errno upon
> > + * failure.
> > + */
> > +static inline struct ib_mr *rdma_get_dma_mr(struct ib_pd *pd, int roles,
> > +					    int attrs)
> > +{
> > +	return ib_get_dma_mr(pd, rdma_device_access_flags(pd, roles, attrs));
> > +}
> > +
> 
> I still think this wrapper should go away...

Ok. I'll remove all uses of ib_get_dma_mr()...


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise July 6, 2015, 2:39 p.m. UTC | #8
> -----Original Message-----
> From: Sagi Grimberg [mailto:sagig@dev.mellanox.co.il]
> Sent: Monday, July 06, 2015 2:59 AM
> To: Steve Wise; dledford@redhat.com
> Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-
> devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org
> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
> 
> On 7/6/2015 2:22 AM, Steve Wise wrote:
> > The semantics for MR access flags are not consistent across RDMA
> > protocols.  So rather than have applications try and glean what they
> > need, have them pass in the intended roles and attributes for the MR to
> > be allocated and let the RDMA core select the appropriate access flags
> > given the roles, attributes, and device capabilities.
> >
> > We introduce rdma_mr_roles and rdma_mr_attributes that enumerate the
> > possible roles and attributes for a MR.  These are documented in the
> > enums themselves.
> >
> > New services exported:
> >
> > rdma_device_access_flags() - given the intended MR roles and attributes
> > passed in, return the ib_access_flags bits for the device.
> >
> > rdma_get_dma_mr() - allocate a dma mr using the applications intended
> > MR roles and MR attributes.  This uses rdma_device_access_flags().
> >
> > rdma_fast_reg_access_flags() - return the ib_access_flags bits needed
> > for a fast register WR given the applications intended MR roles and
> > MR attributes.  This uses rdma_device_access_flags().
> >
> > Signed-off-by: Steve Wise <swise@opengridcomputing.com>
> > ---
> >
> >   drivers/infiniband/core/verbs.c |   30 +++++++++++
> >   include/rdma/ib_verbs.h         |  106 +++++++++++++++++++++++++++++++++++++++
> >   2 files changed, 136 insertions(+), 0 deletions(-)
> >
> > diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> > index bac3fb4..f42595c 100644
> > --- a/drivers/infiniband/core/verbs.c
> > +++ b/drivers/infiniband/core/verbs.c
> > @@ -1144,6 +1144,36 @@ struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags)
> >   }
> >   EXPORT_SYMBOL(ib_get_dma_mr);
> >
> > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs)
> > +{
> > +	int access_flags = attrs;
> > +
> > +	if (roles & RDMA_MRR_RECV)
> > +		access_flags |= IB_ACCESS_LOCAL_WRITE;
> > +
> > +	if (roles & RDMA_MRR_WRITE_DEST)
> > +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
> > +
> > +	if (roles & RDMA_MRR_READ_DEST) {
> > +		access_flags |= IB_ACCESS_LOCAL_WRITE;
> > +		if (rdma_protocol_iwarp(pd->device,
> > +					rdma_start_port(pd->device)))
> > +			access_flags |= IB_ACCESS_REMOTE_WRITE;
> > +	}
> > +
> > +	if (roles & RDMA_MRR_READ_SOURCE)
> > +		access_flags |= IB_ACCESS_REMOTE_READ;
> > +
> > +	if (roles & RDMA_MRR_ATOMIC_DEST)
> > +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC;
> > +
> > +	if (roles & RDMA_MRR_MW_BIND)
> > +		access_flags |= IB_ACCESS_MW_BIND;
> > +
> > +	return access_flags;
> > +}
> > +EXPORT_SYMBOL(rdma_device_access_flags);
> > +
> >   struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd,
> >   			     struct ib_phys_buf *phys_buf_array,
> >   			     int num_phys_buf,
> > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> > index 986fddb..da1eadf 100644
> > --- a/include/rdma/ib_verbs.h
> > +++ b/include/rdma/ib_verbs.h
> > @@ -2494,6 +2494,112 @@ static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt)
> >   struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags);
> >
> >   /**
> > + * rdma_mr_roles - possible roles an RDMA MR will be used for
> > + *
> > + * This allows a transport independent RDMA application to
> > + * create MRs that are usable for all the desired roles w/o
> > + * having to understand which access rights are needed.
> > + */
> > +enum {
> > +
> > +	/* lkey used in a ib_recv_wr sge */
> > +	RDMA_MRR_RECV			= 1,
> > +
> > +	/* lkey used for IB_WR_SEND* ops in the ib_send_wr sge */
> > +	RDMA_MRR_SEND			= (1<<1),
> > +
> > +	/* rkey used for IB_WR_RDMA_READ* ops in ib_send_wr wr.rdma.rkey */
> > +	RDMA_MRR_READ_SOURCE		= (1<<2),
> > +
> > +	/* lkey used for IB_WR_RDMA_READ* ops in the ib_send_wr sge */
> > +	RDMA_MRR_READ_DEST		= (1<<3),
> > +
> > +	/* lkey used for IB_WR_RDMA_WRITE* ops in the ib_send_wr sge */
> > +	RDMA_MRR_WRITE_SOURCE		= (1<<4),
> > +
> > +	/* rkey used for IB_WR_RDMA_WRITE* ops in ib_send_wr wr.rdma.rkey */
> > +	RDMA_MRR_WRITE_DEST		= (1<<5),
> > +
> > +	/*
> > +	 * lkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in the ib_send_wr sge
> > +	 */
> > +	RDMA_MRR_ATOMIC_SOURCE		= (1<<6),
> > +
> > +	/*
> > +	 * rkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in ib_send_wr
> > +	 * wr.atomic.rkey
> > +	 */
> > +	RDMA_MRR_ATOMIC_DEST		= (1<<7),
> > +
> > +	/* MR used for a IB_WR_MW_BIND in ib_send_wr wr.bind_mw.bind_info.mr */
> > +	RDMA_MRR_MW_BIND		= (1<<8),
> > +};
> > +
> > +/**
> > + * rdma_mr_attributes - attributes for rdma memory regions
> > + */
> > +enum {
> > +	RDMA_MRA_ZERO_BASED		= IB_ZERO_BASED,
> > +	RDMA_MRA_ACCESS_ON_DEMAND	= IB_ACCESS_ON_DEMAND,
> > +};
> > +
> > +/**
> > + * rdma_device_access_flags - Returns the device-specific MR access flags.
> > + * @pd: The protection domain associated with the memory region.
> > + * @roles: The intended roles of the MR
> > + * @attrs: The desired attributes of the MR
> > + *
> > + * Use the intended roles from @roles along with @attrs and the device
> > + * capabilities to generate the needed access rights.
> > + *
> > + * Return: the ib_access_flags value needed to allocate the MR.
> > + */
> > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs);
> > +
> > +/**
> > + * rdma_get_dma_mr - Returns a memory region for system memory that is
> > + * usable for DMA.
> > + * @pd: The protection domain associated with the memory region.
> > + * @roles: The intended roles of the MR
> > + * @attrs: The desired attributes of the MR
> > + *
> > + * Use the intended roles from @roles along with @attrs and the device
> > + * capabilities to define the needed access rights, and call
> > + * ib_get_dma_mr() to allocate the MR.
> > + *
> > + * Note that the ib_dma_*() functions defined below must be used
> > + * to create/destroy addresses used with the Lkey or Rkey returned
> > + * by ib_get_dma_mr().
> > + *
> > + * Return: struct ib_mr pointer upon success, or a pointer encoded errno upon
> > + * failure.
> > + */
> > +static inline struct ib_mr *rdma_get_dma_mr(struct ib_pd *pd, int roles,
> > +					    int attrs)
> > +{
> > +	return ib_get_dma_mr(pd, rdma_device_access_flags(pd, roles, attrs));
> > +}
> > +
> > +/**
> > + * rdma_fast_reg_access_flags - Returns the access flags needed for a fast
> > + * register operation.
> > + * @pd: The protection domain associated with the memory region.
> > + * @roles: The intended roles of the MR
> > + * @attrs: The desired attributes of the MR
> > + *
> > + * Use the intended roles from @roles along with @attrs and the device
> > + * capabilities to define the needed access rights for a fast registration
> > + * work request.
> > + *
> > + * Return: the ib_access_flags value needed for fast registration.
> > + */
> > +static inline int rdma_fast_reg_access_flags(struct ib_pd *pd, int roles,
> > +					     int attrs)
> > +{
> > +	return rdma_device_access_flags(pd, roles, attrs);
> > +}
> 
> Why do we have rdma_fast_reg_access_flags() that just trampolines to
> rdma_device_access_flags()? why do we need it?

My initial thought was that fast_reg access flags might be different that general MR access flags for some devices.  But since that isn't the cases, I'll remove it.  


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 6, 2015, 4:17 p.m. UTC | #9
On 7/6/2015 5:37 PM, Steve Wise wrote:
>
>
>> -----Original Message-----
>> From: Sagi Grimberg [mailto:sagig@dev.mellanox.co.il]
>> Sent: Monday, July 06, 2015 2:54 AM
>> To: Steve Wise; dledford@redhat.com
>> Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-
>> devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org
>> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
>>
>> On 7/6/2015 2:22 AM, Steve Wise wrote:
>>> The semantics for MR access flags are not consistent across RDMA
>>> protocols.  So rather than have applications try and glean what they
>>> need, have them pass in the intended roles and attributes for the MR to
>>> be allocated and let the RDMA core select the appropriate access flags
>>> given the roles, attributes, and device capabilities.
>>>
>>> We introduce rdma_mr_roles and rdma_mr_attributes that enumerate the
>>> possible roles and attributes for a MR.  These are documented in the
>>> enums themselves.
>>>
>>> New services exported:
>>>
>>> rdma_device_access_flags() - given the intended MR roles and attributes
>>> passed in, return the ib_access_flags bits for the device.
>>>
>>> rdma_get_dma_mr() - allocate a dma mr using the applications intended
>>> MR roles and MR attributes.  This uses rdma_device_access_flags().
>>>
>>> rdma_fast_reg_access_flags() - return the ib_access_flags bits needed
>>> for a fast register WR given the applications intended MR roles and
>>> MR attributes.  This uses rdma_device_access_flags().
>>>
>>> Signed-off-by: Steve Wise <swise@opengridcomputing.com>
>>> ---
>>>
>>>    drivers/infiniband/core/verbs.c |   30 +++++++++++
>>>    include/rdma/ib_verbs.h         |  106 +++++++++++++++++++++++++++++++++++++++
>>>    2 files changed, 136 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
>>> index bac3fb4..f42595c 100644
>>> --- a/drivers/infiniband/core/verbs.c
>>> +++ b/drivers/infiniband/core/verbs.c
>>> @@ -1144,6 +1144,36 @@ struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags)
>>>    }
>>>    EXPORT_SYMBOL(ib_get_dma_mr);
>>>
>>> +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs)
>>> +{
>>> +	int access_flags = attrs;
>>> +
>>> +	if (roles & RDMA_MRR_RECV)
>>> +		access_flags |= IB_ACCESS_LOCAL_WRITE;
>>> +
>>> +	if (roles & RDMA_MRR_WRITE_DEST)
>>> +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
>>> +
>>> +	if (roles & RDMA_MRR_READ_DEST) {
>>> +		access_flags |= IB_ACCESS_LOCAL_WRITE;
>>> +		if (rdma_protocol_iwarp(pd->device,
>>> +					rdma_start_port(pd->device)))
>>> +			access_flags |= IB_ACCESS_REMOTE_WRITE;
>>> +	}
>>> +
>>> +	if (roles & RDMA_MRR_READ_SOURCE)
>>> +		access_flags |= IB_ACCESS_REMOTE_READ;
>>> +
>>> +	if (roles & RDMA_MRR_ATOMIC_DEST)
>>> +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC;
>>> +
>>> +	if (roles & RDMA_MRR_MW_BIND)
>>> +		access_flags |= IB_ACCESS_MW_BIND;
>>> +
>>> +	return access_flags;
>>> +}
>>> +EXPORT_SYMBOL(rdma_device_access_flags);
>>> +
>>>    struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd,
>>>    			     struct ib_phys_buf *phys_buf_array,
>>>    			     int num_phys_buf,
>>> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
>>> index 986fddb..da1eadf 100644
>>> --- a/include/rdma/ib_verbs.h
>>> +++ b/include/rdma/ib_verbs.h
>>> @@ -2494,6 +2494,112 @@ static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt)
>>>    struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags);
>>>
>>>    /**
>>> + * rdma_mr_roles - possible roles an RDMA MR will be used for
>>> + *
>>> + * This allows a transport independent RDMA application to
>>> + * create MRs that are usable for all the desired roles w/o
>>> + * having to understand which access rights are needed.
>>> + */
>>> +enum {
>>> +
>>> +	/* lkey used in a ib_recv_wr sge */
>>> +	RDMA_MRR_RECV			= 1,
>>> +
>>> +	/* lkey used for IB_WR_SEND* ops in the ib_send_wr sge */
>>> +	RDMA_MRR_SEND			= (1<<1),
>>> +
>>> +	/* rkey used for IB_WR_RDMA_READ* ops in ib_send_wr wr.rdma.rkey */
>>> +	RDMA_MRR_READ_SOURCE		= (1<<2),
>>> +
>>> +	/* lkey used for IB_WR_RDMA_READ* ops in the ib_send_wr sge */
>>> +	RDMA_MRR_READ_DEST		= (1<<3),
>>> +
>>> +	/* lkey used for IB_WR_RDMA_WRITE* ops in the ib_send_wr sge */
>>> +	RDMA_MRR_WRITE_SOURCE		= (1<<4),
>>> +
>>> +	/* rkey used for IB_WR_RDMA_WRITE* ops in ib_send_wr wr.rdma.rkey */
>>> +	RDMA_MRR_WRITE_DEST		= (1<<5),
>>> +
>>> +	/*
>>> +	 * lkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in the ib_send_wr sge
>>> +	 */
>>> +	RDMA_MRR_ATOMIC_SOURCE		= (1<<6),
>>> +
>>> +	/*
>>> +	 * rkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in ib_send_wr
>>> +	 * wr.atomic.rkey
>>> +	 */
>>> +	RDMA_MRR_ATOMIC_DEST		= (1<<7),
>>> +
>>> +	/* MR used for a IB_WR_MW_BIND in ib_send_wr wr.bind_mw.bind_info.mr */
>>> +	RDMA_MRR_MW_BIND		= (1<<8),
>>> +};
>>> +
>>> +/**
>>> + * rdma_mr_attributes - attributes for rdma memory regions
>>> + */
>>> +enum {
>>> +	RDMA_MRA_ZERO_BASED		= IB_ZERO_BASED,
>>> +	RDMA_MRA_ACCESS_ON_DEMAND	= IB_ACCESS_ON_DEMAND,
>>> +};
>>> +
>>> +/**
>>> + * rdma_device_access_flags - Returns the device-specific MR access flags.
>>> + * @pd: The protection domain associated with the memory region.
>>> + * @roles: The intended roles of the MR
>>> + * @attrs: The desired attributes of the MR
>>> + *
>>> + * Use the intended roles from @roles along with @attrs and the device
>>> + * capabilities to generate the needed access rights.
>>> + *
>>> + * Return: the ib_access_flags value needed to allocate the MR.
>>> + */
>>> +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs);
>>> +
>>> +/**
>>> + * rdma_get_dma_mr - Returns a memory region for system memory that is
>>> + * usable for DMA.
>>> + * @pd: The protection domain associated with the memory region.
>>> + * @roles: The intended roles of the MR
>>> + * @attrs: The desired attributes of the MR
>>> + *
>>> + * Use the intended roles from @roles along with @attrs and the device
>>> + * capabilities to define the needed access rights, and call
>>> + * ib_get_dma_mr() to allocate the MR.
>>> + *
>>> + * Note that the ib_dma_*() functions defined below must be used
>>> + * to create/destroy addresses used with the Lkey or Rkey returned
>>> + * by ib_get_dma_mr().
>>> + *
>>> + * Return: struct ib_mr pointer upon success, or a pointer encoded errno upon
>>> + * failure.
>>> + */
>>> +static inline struct ib_mr *rdma_get_dma_mr(struct ib_pd *pd, int roles,
>>> +					    int attrs)
>>> +{
>>> +	return ib_get_dma_mr(pd, rdma_device_access_flags(pd, roles, attrs));
>>> +}
>>> +
>>
>> I still think this wrapper should go away...
>
> Ok. I'll remove all uses of ib_get_dma_mr()...
>
>

I meant that rdma_get_dma_mr can go away. I'd prefer to get the
needed access_flags and just call existing verb.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise July 6, 2015, 4:55 p.m. UTC | #10
> -----Original Message-----
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On Behalf Of Sagi Grimberg
> Sent: Monday, July 06, 2015 11:18 AM
> To: Steve Wise; dledford@redhat.com
> Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-
> devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org
> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
> 
> On 7/6/2015 5:37 PM, Steve Wise wrote:
> >
> >
> >> -----Original Message-----
> >> From: Sagi Grimberg [mailto:sagig@dev.mellanox.co.il]
> >> Sent: Monday, July 06, 2015 2:54 AM
> >> To: Steve Wise; dledford@redhat.com
> >> Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-
> >> devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org
> >> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
> >>
> >> On 7/6/2015 2:22 AM, Steve Wise wrote:
> >>> The semantics for MR access flags are not consistent across RDMA
> >>> protocols.  So rather than have applications try and glean what they
> >>> need, have them pass in the intended roles and attributes for the MR to
> >>> be allocated and let the RDMA core select the appropriate access flags
> >>> given the roles, attributes, and device capabilities.
> >>>
> >>> We introduce rdma_mr_roles and rdma_mr_attributes that enumerate the
> >>> possible roles and attributes for a MR.  These are documented in the
> >>> enums themselves.
> >>>
> >>> New services exported:
> >>>
> >>> rdma_device_access_flags() - given the intended MR roles and attributes
> >>> passed in, return the ib_access_flags bits for the device.
> >>>
> >>> rdma_get_dma_mr() - allocate a dma mr using the applications intended
> >>> MR roles and MR attributes.  This uses rdma_device_access_flags().
> >>>
> >>> rdma_fast_reg_access_flags() - return the ib_access_flags bits needed
> >>> for a fast register WR given the applications intended MR roles and
> >>> MR attributes.  This uses rdma_device_access_flags().
> >>>
> >>> Signed-off-by: Steve Wise <swise@opengridcomputing.com>
> >>> ---
> >>>
> >>>    drivers/infiniband/core/verbs.c |   30 +++++++++++
> >>>    include/rdma/ib_verbs.h         |  106 +++++++++++++++++++++++++++++++++++++++
> >>>    2 files changed, 136 insertions(+), 0 deletions(-)
> >>>
> >>> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> >>> index bac3fb4..f42595c 100644
> >>> --- a/drivers/infiniband/core/verbs.c
> >>> +++ b/drivers/infiniband/core/verbs.c
> >>> @@ -1144,6 +1144,36 @@ struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags)
> >>>    }
> >>>    EXPORT_SYMBOL(ib_get_dma_mr);
> >>>
> >>> +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs)
> >>> +{
> >>> +	int access_flags = attrs;
> >>> +
> >>> +	if (roles & RDMA_MRR_RECV)
> >>> +		access_flags |= IB_ACCESS_LOCAL_WRITE;
> >>> +
> >>> +	if (roles & RDMA_MRR_WRITE_DEST)
> >>> +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
> >>> +
> >>> +	if (roles & RDMA_MRR_READ_DEST) {
> >>> +		access_flags |= IB_ACCESS_LOCAL_WRITE;
> >>> +		if (rdma_protocol_iwarp(pd->device,
> >>> +					rdma_start_port(pd->device)))
> >>> +			access_flags |= IB_ACCESS_REMOTE_WRITE;
> >>> +	}
> >>> +
> >>> +	if (roles & RDMA_MRR_READ_SOURCE)
> >>> +		access_flags |= IB_ACCESS_REMOTE_READ;
> >>> +
> >>> +	if (roles & RDMA_MRR_ATOMIC_DEST)
> >>> +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC;
> >>> +
> >>> +	if (roles & RDMA_MRR_MW_BIND)
> >>> +		access_flags |= IB_ACCESS_MW_BIND;
> >>> +
> >>> +	return access_flags;
> >>> +}
> >>> +EXPORT_SYMBOL(rdma_device_access_flags);
> >>> +
> >>>    struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd,
> >>>    			     struct ib_phys_buf *phys_buf_array,
> >>>    			     int num_phys_buf,
> >>> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> >>> index 986fddb..da1eadf 100644
> >>> --- a/include/rdma/ib_verbs.h
> >>> +++ b/include/rdma/ib_verbs.h
> >>> @@ -2494,6 +2494,112 @@ static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt)
> >>>    struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags);
> >>>
> >>>    /**
> >>> + * rdma_mr_roles - possible roles an RDMA MR will be used for
> >>> + *
> >>> + * This allows a transport independent RDMA application to
> >>> + * create MRs that are usable for all the desired roles w/o
> >>> + * having to understand which access rights are needed.
> >>> + */
> >>> +enum {
> >>> +
> >>> +	/* lkey used in a ib_recv_wr sge */
> >>> +	RDMA_MRR_RECV			= 1,
> >>> +
> >>> +	/* lkey used for IB_WR_SEND* ops in the ib_send_wr sge */
> >>> +	RDMA_MRR_SEND			= (1<<1),
> >>> +
> >>> +	/* rkey used for IB_WR_RDMA_READ* ops in ib_send_wr wr.rdma.rkey */
> >>> +	RDMA_MRR_READ_SOURCE		= (1<<2),
> >>> +
> >>> +	/* lkey used for IB_WR_RDMA_READ* ops in the ib_send_wr sge */
> >>> +	RDMA_MRR_READ_DEST		= (1<<3),
> >>> +
> >>> +	/* lkey used for IB_WR_RDMA_WRITE* ops in the ib_send_wr sge */
> >>> +	RDMA_MRR_WRITE_SOURCE		= (1<<4),
> >>> +
> >>> +	/* rkey used for IB_WR_RDMA_WRITE* ops in ib_send_wr wr.rdma.rkey */
> >>> +	RDMA_MRR_WRITE_DEST		= (1<<5),
> >>> +
> >>> +	/*
> >>> +	 * lkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in the ib_send_wr sge
> >>> +	 */
> >>> +	RDMA_MRR_ATOMIC_SOURCE		= (1<<6),
> >>> +
> >>> +	/*
> >>> +	 * rkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in ib_send_wr
> >>> +	 * wr.atomic.rkey
> >>> +	 */
> >>> +	RDMA_MRR_ATOMIC_DEST		= (1<<7),
> >>> +
> >>> +	/* MR used for a IB_WR_MW_BIND in ib_send_wr wr.bind_mw.bind_info.mr */
> >>> +	RDMA_MRR_MW_BIND		= (1<<8),
> >>> +};
> >>> +
> >>> +/**
> >>> + * rdma_mr_attributes - attributes for rdma memory regions
> >>> + */
> >>> +enum {
> >>> +	RDMA_MRA_ZERO_BASED		= IB_ZERO_BASED,
> >>> +	RDMA_MRA_ACCESS_ON_DEMAND	= IB_ACCESS_ON_DEMAND,
> >>> +};
> >>> +
> >>> +/**
> >>> + * rdma_device_access_flags - Returns the device-specific MR access flags.
> >>> + * @pd: The protection domain associated with the memory region.
> >>> + * @roles: The intended roles of the MR
> >>> + * @attrs: The desired attributes of the MR
> >>> + *
> >>> + * Use the intended roles from @roles along with @attrs and the device
> >>> + * capabilities to generate the needed access rights.
> >>> + *
> >>> + * Return: the ib_access_flags value needed to allocate the MR.
> >>> + */
> >>> +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs);
> >>> +
> >>> +/**
> >>> + * rdma_get_dma_mr - Returns a memory region for system memory that is
> >>> + * usable for DMA.
> >>> + * @pd: The protection domain associated with the memory region.
> >>> + * @roles: The intended roles of the MR
> >>> + * @attrs: The desired attributes of the MR
> >>> + *
> >>> + * Use the intended roles from @roles along with @attrs and the device
> >>> + * capabilities to define the needed access rights, and call
> >>> + * ib_get_dma_mr() to allocate the MR.
> >>> + *
> >>> + * Note that the ib_dma_*() functions defined below must be used
> >>> + * to create/destroy addresses used with the Lkey or Rkey returned
> >>> + * by ib_get_dma_mr().
> >>> + *
> >>> + * Return: struct ib_mr pointer upon success, or a pointer encoded errno upon
> >>> + * failure.
> >>> + */
> >>> +static inline struct ib_mr *rdma_get_dma_mr(struct ib_pd *pd, int roles,
> >>> +					    int attrs)
> >>> +{
> >>> +	return ib_get_dma_mr(pd, rdma_device_access_flags(pd, roles, attrs));
> >>> +}
> >>> +
> >>
> >> I still think this wrapper should go away...
> >
> > Ok. I'll remove all uses of ib_get_dma_mr()...
> >
> >
> 
> I meant that rdma_get_dma_mr can go away. I'd prefer to get the
> needed access_flags and just call existing verb.
> 

So just export rdma_device_access_flags(rd, roles, attrs), and then change the ULPs to call this to obtain the access flags.  That is ok with me.  What do others think?

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig July 7, 2015, 8:58 a.m. UTC | #11
On Mon, Jul 06, 2015 at 09:23:42AM -0500, Steve Wise wrote:
> > Please add an assert for the values that are allowed for attrs.
> > 
> > It also would be highly useful to add a kerneldoc comment describing
> > the function and the parameters.  Also __bitwise sparse tricks
> > to ensure the right flags are passed wouldn't be a too bad idea.
> > 
> 
> Can you explain the "__bitwise sparse tricks"?  Or point me to an example.  

Grep the kernel tree for __bitwise.  It allows creating a type with
additional type annotations so that sparse will warn when assigning
incorrect values to it.

> > Why not specfiy these as separate nuerical namespace?
> > 
> 
> To avoid having to translate them. 

But they are different values, so you should translate them.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig July 7, 2015, 9 a.m. UTC | #12
On Mon, Jul 06, 2015 at 07:17:38PM +0300, Sagi Grimberg wrote:
> >Ok. I'll remove all uses of ib_get_dma_mr()...
> >
> >
> 
> I meant that rdma_get_dma_mr can go away. I'd prefer to get the
> needed access_flags and just call existing verb.

I strongly disagree.  As this series has shown the existing API is not
epressive enough for all transports.  It thus needs to go away and be
fully replaced by the new API introduced here.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 7, 2015, 9:14 a.m. UTC | #13
On 7/7/2015 12:00 PM, Christoph Hellwig wrote:
> On Mon, Jul 06, 2015 at 07:17:38PM +0300, Sagi Grimberg wrote:
>>> Ok. I'll remove all uses of ib_get_dma_mr()...
>>>
>>>
>>
>> I meant that rdma_get_dma_mr can go away. I'd prefer to get the
>> needed access_flags and just call existing verb.
>
> I strongly disagree.  As this series has shown the existing API is not
> epressive enough for all transports.  It thus needs to go away and be
> fully replaced by the new API introduced here.
>

Christoph,

I wasn't arguing about having a transport independent API. I was
referring to this wrapper specifically that trampolines to
ib_get_dma_mr() with rdma_device_access_flags(pd, roles, attrs) helper.

The rdma_device_access_flags() itself is fine. However, given that
this helper is used elsewhere as well, I don't see the point of having
yet another helper specifically just for the dma_mr case that does
nothing more than trampolines with a call to rdma_device_access_flags().

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise July 7, 2015, 2:05 p.m. UTC | #14
> -----Original Message-----
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On Behalf Of Sagi Grimberg
> Sent: Tuesday, July 07, 2015 4:15 AM
> To: Christoph Hellwig
> Cc: Steve Wise; dledford@redhat.com; sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org;
> eli@mellanox.com; target-devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org
> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
> 
> On 7/7/2015 12:00 PM, Christoph Hellwig wrote:
> > On Mon, Jul 06, 2015 at 07:17:38PM +0300, Sagi Grimberg wrote:
> >>> Ok. I'll remove all uses of ib_get_dma_mr()...
> >>>
> >>>
> >>
> >> I meant that rdma_get_dma_mr can go away. I'd prefer to get the
> >> needed access_flags and just call existing verb.
> >
> > I strongly disagree.  As this series has shown the existing API is not
> > epressive enough for all transports.  It thus needs to go away and be
> > fully replaced by the new API introduced here.
> >
> 
> Christoph,
> 
> I wasn't arguing about having a transport independent API. I was
> referring to this wrapper specifically that trampolines to
> ib_get_dma_mr() with rdma_device_access_flags(pd, roles, attrs) helper.
> 
> The rdma_device_access_flags() itself is fine. However, given that
> this helper is used elsewhere as well, I don't see the point of having
> yet another helper specifically just for the dma_mr case that does
> nothing more than trampolines with a call to rdma_device_access_flags().
> 

I took the feedback from Christoph and Jason to mean I should remove ib_get_dma_mr() entirely and pull its guts into
rdma_get_dma_mr(), and change all the users of ib_get_dma_mr() to use rdma_get_dma_mr().  So the net result isn't a wrapper.  It
would of course still use rdma_device_access_flags()...

Steve. 


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise July 7, 2015, 2:17 p.m. UTC | #15
> -----Original Message-----
> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of Haggai Eran
> Sent: Monday, July 06, 2015 2:09 AM
> To: Steve Wise; dledford@redhat.com
> Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-
> devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org
> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
> 
> On 06/07/2015 02:22, Steve Wise wrote:
> > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs)
> > +{
> > +	int access_flags = attrs;
> > +
> > +	if (roles & RDMA_MRR_RECV)
> > +		access_flags |= IB_ACCESS_LOCAL_WRITE;
> > +
> > +	if (roles & RDMA_MRR_WRITE_DEST)
> > +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
> > +
> > +	if (roles & RDMA_MRR_READ_DEST) {
> > +		access_flags |= IB_ACCESS_LOCAL_WRITE;
> > +		if (rdma_protocol_iwarp(pd->device,
> > +					rdma_start_port(pd->device)))
> > +			access_flags |= IB_ACCESS_REMOTE_WRITE;
> > +	}
> > +
> > +	if (roles & RDMA_MRR_READ_SOURCE)
> > +		access_flags |= IB_ACCESS_REMOTE_READ;
> > +
> > +	if (roles & RDMA_MRR_ATOMIC_DEST)
> > +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC;
> 
> I think you need LOCAL_WRITE for ATOMIC_SOURCE in order to receive the
> results of the atomic operation.
>

Where/how are the results returned?  In a recv completion?  If so, then that MR would need RDMA_MRR_RECV, not RDMA_MRR_ATOMIC_SOURCE.




--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Haggai Eran July 7, 2015, 2:34 p.m. UTC | #16
On 07/07/2015 17:17, Steve Wise wrote:
> 
> 
>> -----Original Message-----
>> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of Haggai Eran
>> Sent: Monday, July 06, 2015 2:09 AM
>> To: Steve Wise; dledford@redhat.com
>> Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-
>> devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org
>> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
>>
>> On 06/07/2015 02:22, Steve Wise wrote:
>>> +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs)
>>> +{
>>> +	int access_flags = attrs;
>>> +
>>> +	if (roles & RDMA_MRR_RECV)
>>> +		access_flags |= IB_ACCESS_LOCAL_WRITE;
>>> +
>>> +	if (roles & RDMA_MRR_WRITE_DEST)
>>> +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
>>> +
>>> +	if (roles & RDMA_MRR_READ_DEST) {
>>> +		access_flags |= IB_ACCESS_LOCAL_WRITE;
>>> +		if (rdma_protocol_iwarp(pd->device,
>>> +					rdma_start_port(pd->device)))
>>> +			access_flags |= IB_ACCESS_REMOTE_WRITE;
>>> +	}
>>> +
>>> +	if (roles & RDMA_MRR_READ_SOURCE)
>>> +		access_flags |= IB_ACCESS_REMOTE_READ;
>>> +
>>> +	if (roles & RDMA_MRR_ATOMIC_DEST)
>>> +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC;
>>
>> I think you need LOCAL_WRITE for ATOMIC_SOURCE in order to receive the
>> results of the atomic operation.
>>
> 
> Where/how are the results returned?  In a recv completion?  If so, then that MR would need RDMA_MRR_RECV, not RDMA_MRR_ATOMIC_SOURCE.

They are returned in the scatter list provided in ib_send_wr.sg_list,
similarly to how RDMA read results are returned.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise July 7, 2015, 2:46 p.m. UTC | #17
> -----Original Message-----
> From: Haggai Eran [mailto:haggaie@mellanox.com]
> Sent: Tuesday, July 07, 2015 9:34 AM
> To: Steve Wise; dledford@redhat.com
> Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-
> devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org
> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
> 
> On 07/07/2015 17:17, Steve Wise wrote:
> >
> >
> >> -----Original Message-----
> >> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of Haggai Eran
> >> Sent: Monday, July 06, 2015 2:09 AM
> >> To: Steve Wise; dledford@redhat.com
> >> Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-
> >> devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org
> >> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
> >>
> >> On 06/07/2015 02:22, Steve Wise wrote:
> >>> +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs)
> >>> +{
> >>> +	int access_flags = attrs;
> >>> +
> >>> +	if (roles & RDMA_MRR_RECV)
> >>> +		access_flags |= IB_ACCESS_LOCAL_WRITE;
> >>> +
> >>> +	if (roles & RDMA_MRR_WRITE_DEST)
> >>> +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
> >>> +
> >>> +	if (roles & RDMA_MRR_READ_DEST) {
> >>> +		access_flags |= IB_ACCESS_LOCAL_WRITE;
> >>> +		if (rdma_protocol_iwarp(pd->device,
> >>> +					rdma_start_port(pd->device)))
> >>> +			access_flags |= IB_ACCESS_REMOTE_WRITE;
> >>> +	}
> >>> +
> >>> +	if (roles & RDMA_MRR_READ_SOURCE)
> >>> +		access_flags |= IB_ACCESS_REMOTE_READ;
> >>> +
> >>> +	if (roles & RDMA_MRR_ATOMIC_DEST)
> >>> +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC;
> >>
> >> I think you need LOCAL_WRITE for ATOMIC_SOURCE in order to receive the
> >> results of the atomic operation.
> >>
> >
> > Where/how are the results returned?  In a recv completion?  If so, then that MR would need RDMA_MRR_RECV, not
> RDMA_MRR_ATOMIC_SOURCE.
> 
> They are returned in the scatter list provided in ib_send_wr.sg_list,
> similarly to how RDMA read results are returned.

Ah.  Hmm.  I was confused about how the atomic operations worked.  

Is this correct:

ib_send_wr.wr.atomic.remote_addr : the peer's address that will be the target of the atomic operation.
ib_send_wr.wr.atomic.compare_add/compare_add_mask: the data to be used in the atomic compare-and-add on the target address
ib_send_wr.wr.atomic.swap/swap_mask: the data to be used in an atomic swap on the target address.
ib_send_wr.sg_list: results from the swap or compare/add.

Is the above correct? 

Maybe the two role names should be RDMA_MRR_ATOMIC_TARGET and RDMA_MRR_ATOMIC_RESULT?

Steve.

 

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Haggai Eran July 7, 2015, 3:07 p.m. UTC | #18
On 07/07/2015 17:46, Steve Wise wrote:
> 
> 
>> -----Original Message-----
>> From: Haggai Eran [mailto:haggaie@mellanox.com]
>> Sent: Tuesday, July 07, 2015 9:34 AM
>> To: Steve Wise; dledford@redhat.com
>> Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-
>> devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org
>> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
>>
>> On 07/07/2015 17:17, Steve Wise wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of Haggai Eran
>>>> Sent: Monday, July 06, 2015 2:09 AM
>>>> To: Steve Wise; dledford@redhat.com
>>>> Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-
>>>> devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org
>>>> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
>>>>
>>>> On 06/07/2015 02:22, Steve Wise wrote:
>>>>> +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs)
>>>>> +{
>>>>> +	int access_flags = attrs;
>>>>> +
>>>>> +	if (roles & RDMA_MRR_RECV)
>>>>> +		access_flags |= IB_ACCESS_LOCAL_WRITE;
>>>>> +
>>>>> +	if (roles & RDMA_MRR_WRITE_DEST)
>>>>> +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
>>>>> +
>>>>> +	if (roles & RDMA_MRR_READ_DEST) {
>>>>> +		access_flags |= IB_ACCESS_LOCAL_WRITE;
>>>>> +		if (rdma_protocol_iwarp(pd->device,
>>>>> +					rdma_start_port(pd->device)))
>>>>> +			access_flags |= IB_ACCESS_REMOTE_WRITE;
>>>>> +	}
>>>>> +
>>>>> +	if (roles & RDMA_MRR_READ_SOURCE)
>>>>> +		access_flags |= IB_ACCESS_REMOTE_READ;
>>>>> +
>>>>> +	if (roles & RDMA_MRR_ATOMIC_DEST)
>>>>> +		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC;
>>>>
>>>> I think you need LOCAL_WRITE for ATOMIC_SOURCE in order to receive the
>>>> results of the atomic operation.
>>>>
>>>
>>> Where/how are the results returned?  In a recv completion?  If so, then that MR would need RDMA_MRR_RECV, not
>> RDMA_MRR_ATOMIC_SOURCE.
>>
>> They are returned in the scatter list provided in ib_send_wr.sg_list,
>> similarly to how RDMA read results are returned.
> 
> Ah.  Hmm.  I was confused about how the atomic operations worked.  
> 
> Is this correct:
> 
> ib_send_wr.wr.atomic.remote_addr : the peer's address that will be the target of the atomic operation.
> ib_send_wr.wr.atomic.compare_add/compare_add_mask: the data to be used in the atomic compare-and-add on the target address
Yes, in compare and swap, this is the value to compare the original
remote data with. In fetch and add, this would be the value to add.

> ib_send_wr.wr.atomic.swap/swap_mask: the data to be used in an atomic swap on the target address.
This is used in compare and swap as the data to write if the comparison
succeeds.

> ib_send_wr.sg_list: results from the swap or compare/add.
> 
> Is the above correct? 
Pretty much. See above.

> 
> Maybe the two role names should be RDMA_MRR_ATOMIC_TARGET and RDMA_MRR_ATOMIC_RESULT?
Sounds good.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 7, 2015, 4:17 p.m. UTC | #19
On Tue, Jul 07, 2015 at 09:05:15AM -0500, Steve Wise wrote:

> I took the feedback from Christoph and Jason to mean I should remove
> ib_get_dma_mr() entirely and pull its guts into rdma_get_dma_mr(),
> and change all the users of ib_get_dma_mr() to use
> rdma_get_dma_mr().  So the net result isn't a wrapper.  It would of
> course still use rdma_device_access_flags()...

Right, to the greatest extent possible.

Keeping ib_get_dma_mr and the old flags around just means someone
could go back and use the old flags.

I expect well need to keep the driver entry point for user space, but
we should be able to hide the API and flags from other modules.

A wrapper is a reasonable way to stage through that transition..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 7, 2015, 4:27 p.m. UTC | #20
On 7/7/2015 7:17 PM, Jason Gunthorpe wrote:
> On Tue, Jul 07, 2015 at 09:05:15AM -0500, Steve Wise wrote:
>
>> I took the feedback from Christoph and Jason to mean I should remove
>> ib_get_dma_mr() entirely and pull its guts into rdma_get_dma_mr(),
>> and change all the users of ib_get_dma_mr() to use
>> rdma_get_dma_mr().  So the net result isn't a wrapper.  It would of
>> course still use rdma_device_access_flags()...
>
> Right, to the greatest extent possible.
>
> Keeping ib_get_dma_mr and the old flags around just means someone
> could go back and use the old flags.

It looks strange to me that there is a helper to get a transport
independent access_flags rdma_device_access_flags() that helps the
user to choose the correct access_flags for MR creation and fast
registration (which is more than OK), and having a wrapper that just
hides it from the user...

Doesn't it look odd to you?

>
> I expect well need to keep the driver entry point for user space, but
> we should be able to hide the API and flags from other modules.

dma_mr for user-space?? How can this be relevant for user-space?

>
> A wrapper is a reasonable way to stage through that transition..

I can't say that I completely agree here.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 7, 2015, 9:36 p.m. UTC | #21
On Tue, Jul 07, 2015 at 07:27:47PM +0300, Sagi Grimberg wrote:

> Doesn't it look odd to you?

Sure, but the oddness is that rdma_device_access_flags exists at all,
not the wrapper. The wrapper is what we want the API to look like,
if we could trivially change the WR format as well then
rdma_device_access_flags wouldn't even exist at all.

> >I expect well need to keep the driver entry point for user space, but
> >we should be able to hide the API and flags from other modules.
> 
> dma_mr for user-space?? How can this be relevant for user-space?

I didn't mean dma_mr specifically, but all the other driver entry
points that use the old-style IBV_ACCESS_* flags.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 8, 2015, 7:29 a.m. UTC | #22
On 7/8/2015 12:36 AM, Jason Gunthorpe wrote:
> On Tue, Jul 07, 2015 at 07:27:47PM +0300, Sagi Grimberg wrote:
>
>> Doesn't it look odd to you?
>
> Sure, but the oddness is that rdma_device_access_flags exists at all,
> not the wrapper. The wrapper is what we want the API to look like,

I don't necessarily agree. The API we'd want is a single API at all
the call sites to all types of MRs. We have different QP types, and
still we don't have an allocation API for each and every one.
I honestly don't see why we have that for MRs.

If we can converge to a single API for MR allocation we can just get it
right once.

> if we could trivially change the WR format as well then
> rdma_device_access_flags wouldn't even exist at all.

I have taken some time to truly think about that following Christoph's
comments to my indirect registration patches. This is one of the things
I'm looking at.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig July 8, 2015, 8:11 a.m. UTC | #23
On Tue, Jul 07, 2015 at 09:05:15AM -0500, Steve Wise wrote:
> I took the feedback from Christoph and Jason to mean I should remove ib_get_dma_mr() entirely and pull its guts into
> rdma_get_dma_mr(), and change all the users of ib_get_dma_mr() to use rdma_get_dma_mr().  So the net result isn't a wrapper.  It
> would of course still use rdma_device_access_flags()...

That's what I'd prefer, yes.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig July 8, 2015, 8:13 a.m. UTC | #24
On Wed, Jul 08, 2015 at 10:29:56AM +0300, Sagi Grimberg wrote:
> I don't necessarily agree. The API we'd want is a single API at all
> the call sites to all types of MRs. We have different QP types, and
> still we don't have an allocation API for each and every one.
> I honestly don't see why we have that for MRs.
> 
> If we can converge to a single API for MR allocation we can just get it
> right once.

That sounds even better in the long run, but sorting out one problem at
a time seems like the easier way forward.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 8, 2015, 10:05 a.m. UTC | #25
On 7/8/2015 11:13 AM, 'Christoph Hellwig' wrote:
> On Wed, Jul 08, 2015 at 10:29:56AM +0300, Sagi Grimberg wrote:
>> I don't necessarily agree. The API we'd want is a single API at all
>> the call sites to all types of MRs. We have different QP types, and
>> still we don't have an allocation API for each and every one.
>> I honestly don't see why we have that for MRs.
>>
>> If we can converge to a single API for MR allocation we can just get it
>> right once.
>
> That sounds even better in the long run, but sorting out one problem at
> a time seems like the easier way forward.
>

If we agree to consolidate on a single MR allocation API, I don't see
how this wrapper is moving us forward. But if you guys prefer to have it
than I don't have a hard objection.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig July 8, 2015, 10:20 a.m. UTC | #26
On Wed, Jul 08, 2015 at 01:05:28PM +0300, Sagi Grimberg wrote:
> If we agree to consolidate on a single MR allocation API, I don't see
> how this wrapper is moving us forward. But if you guys prefer to have it
> than I don't have a hard objection.

Well, when are we going to get that MR allocation API?  Unless it's
ready to go in ASAP I'd rather take a first step in the right direction
instead of waiting.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 8, 2015, 11:08 a.m. UTC | #27
On 7/8/2015 1:20 PM, 'Christoph Hellwig' wrote:
> On Wed, Jul 08, 2015 at 01:05:28PM +0300, Sagi Grimberg wrote:
>> If we agree to consolidate on a single MR allocation API, I don't see
>> how this wrapper is moving us forward. But if you guys prefer to have it
>> than I don't have a hard objection.
>
> Well, when are we going to get that MR allocation API?  Unless it's
> ready to go in ASAP I'd rather take a first step in the right direction
> instead of waiting.
>

I am still not clear if all of us agree that we need it.
Sean and Steve had some disclaimers...
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean July 8, 2015, 5:14 p.m. UTC | #28
> I am still not clear if all of us agree that we need it.
> Sean and Steve had some disclaimers...

A single entry point doesn't help a whole lot if the app must deal with different behavior based on how the API is used.  We have a single entry point for post_send, for example, and that makes things worse.  IMO, we need fewer registration *methods* not fewer calls.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 8, 2015, 7:08 p.m. UTC | #29
On Wed, Jul 08, 2015 at 10:29:56AM +0300, Sagi Grimberg wrote:
> >Sure, but the oddness is that rdma_device_access_flags exists at all,
> >not the wrapper. The wrapper is what we want the API to look like,
> 
> I don't necessarily agree. The API we'd want is a single API at all
> the call sites to all types of MRs.

Well, I'm speaking specifically about the access protection flags
because that is all this patch set is working on.

What to do about the rest of the mess is a whole other problem,
and completely orthogonal to the access protection problem.

> We have different QP types, and still we don't have an allocation
> API for each and every one.  I honestly don't see why we have that
> for MRs.

The MR stuff was never really designed, the HW people provided some
capability and the SW side just raw exposed it, thoughtlessly.

> If we can converge to a single API for MR allocation we can just get it
> right once.

I'm not sure focusing on MR is the right layer, but who knows.

Every time I look at this, I just shake my head and go *WTF*?

Even something simple like issuing a SEND against local memory seems
horribly complicated. Why do we have local_dma_lkey and ib_get_dma_mr?

Why is code using iWarp calling ib_get_dma_mr with
RDMA_MRR_READ_DEST/IB_ACCESS_REMOTE_WRITE ? That is insecure.

Why on earth is NFS using frmr to create RDMA READ lkeys?

The flags change is easy, and makes sense, but this whole thing looks
deeply rotten.

I'd strongly suggest a MR cleanup should look first at purging lkey
completely from the in-kernel API.

I think when you do that, it quickly becomes clear that iWarp's
problem is not a seemingly minor issue with different flag bits, but
that iWarp *cannot* use local_dma_lkey as a RDMA READ local buffer.
Using ib_get_dma_mr(IB_ACCESS_REMOTE_WRITE) is an insecure work
around. So iWarp (and only iWarp) should take the pain of spinning up
temporary local MRs for RDMA READ.

This should all be hidden under a common API and it shouldn't be
sprinkled all over the place in NFS, iSER, etc.

So, I'd say, first order of buisness to fix the MR mess would be to
purge lkey from the in-kernel API and rationalize the local side to
something cleaner.

Then, what is left is all remote MRs and maybe it will be clearer what
to do about them then...

> >if we could trivially change the WR format as well then
> >rdma_device_access_flags wouldn't even exist at all.
> 
> I have taken some time to truly think about that following Christoph's
> comments to my indirect registration patches. This is one of the things
> I'm looking at.

I really hope you can come up with something here, I'm sure you can
get some help on this issue, because all the storage ULPs are
suffering together under this awful set of APIs.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig July 8, 2015, 8:32 p.m. UTC | #30
On Wed, Jul 08, 2015 at 01:08:42PM -0600, Jason Gunthorpe wrote:
> Then, what is left is all remote MRs and maybe it will be clearer what
> to do about them then...

From looking at that for a while the APIs needed seem pretty simple
to me from a consumer perspective:

struct rdma_mr *rmda_alloc_mr(struct ib_pd *pd, unsigned int nr_pages);
void rdma_free_mr(struct rdma_mr *mr);

/* updates *sg if the SG couldn't be fully registered due to offsets */
int rdma_register_sg(struct rdma_mr *mr, struct scatterlist **sg,
		u32 *pkey, u32 *offset, u32 *len);
void rdma_unregister_sg(struct rdma_mr *mr, struct scatterlist *sg);

plus maybe a pool alloc API if we care about FMR performance.

Note that this assumes that the iSER bounce buffer hacks are replaced
with the QUEUE_FLAG_SG_GAPS flags and a change to the SG_IO code to
bounce buffer for vectored SG_IO calls.  I'm happy to provide a patch
for that if I can find a volunteer for testing.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig July 8, 2015, 8:37 p.m. UTC | #31
On Wed, Jul 08, 2015 at 01:32:05PM -0700, 'Christoph Hellwig' wrote:
> /* updates *sg if the SG couldn't be fully registered due to offsets */
> int rdma_register_sg(struct rdma_mr *mr, struct scatterlist **sg,
> 		u32 *pkey, u32 *offset, u32 *len);

plus an "enum dma_data_direction direction" of course.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Talpey July 8, 2015, 9:38 p.m. UTC | #32
On 7/8/2015 3:08 PM, Jason Gunthorpe wrote:
> The MR stuff was never really designed, the HW people provided some
> capability and the SW side just raw exposed it, thoughtlessly.

Jason, I don't disagree that the API can be improved. I have some
responses to your statements below though.

> Why is code using iWarp calling ib_get_dma_mr with
> RDMA_MRR_READ_DEST/IB_ACCESS_REMOTE_WRITE ? That is insecure.

Because the iWARP protocol requires it, which was very much an
intentional decision. It actually is not insecure, as discussed
in detail in RFC5042. However, it is different from Infiniband.

> Why on earth is NFS using frmr to create RDMA READ lkeys?

Because NFS desires to have a single code path that works for all
transports. In addition, using the IB dma_mr as an lkey means that
large I/O (common with NFS) would require multiple RDMA Read
operations, when the page list exceeded the local scatter/gather
limit of the device.

> I think when you do that, it quickly becomes clear that iWarp's
> problem is not a seemingly minor issue with different flag bits, but
> that iWarp *cannot* use local_dma_lkey as a RDMA READ local buffer.
> Using ib_get_dma_mr(IB_ACCESS_REMOTE_WRITE) is an insecure work
> around. So iWarp (and only iWarp) should take the pain of spinning up
> temporary local MRs for RDMA READ.

That is entirely the wrong layer to address this. It would prevent
reuse of MRs, and require upper layers to be aware that this was
the case - which is exactly the opposite of what you are trying
to achieve.

> This should all be hidden under a common API and it shouldn't be
> sprinkled all over the place in NFS, iSER, etc.

Are you arguing that all upper layer storage protocols take a single
common approach to memory registration? That is a very different
discussion.

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 8, 2015, 11:36 p.m. UTC | #33
On Wed, Jul 08, 2015 at 05:38:05PM -0400, Tom Talpey wrote:
> On 7/8/2015 3:08 PM, Jason Gunthorpe wrote:
> >The MR stuff was never really designed, the HW people provided some
> >capability and the SW side just raw exposed it, thoughtlessly.
> 
> Jason, I don't disagree that the API can be improved. I have some
> responses to your statements below though.
> 
> >Why is code using iWarp calling ib_get_dma_mr with
> >RDMA_MRR_READ_DEST/IB_ACCESS_REMOTE_WRITE ? That is insecure.
> 
> Because the iWARP protocol requires it, which was very much an
> intentional decision. It actually is not insecure, as discussed
> in detail in RFC5042. However, it is different from Infiniband.

You'll have to point me to the section on that..

ib_get_dma_mr(IB_ACCESS_REMOTE_WRITE) is defined to create a remote
access MR to all of physical memory. That means there exists a packet
the far side can send that will write to any place in physical memory.

Yes, the far side has to guess the key, but there is no way you'll
convince me that is secure, and 6.3.4 supports that position.

   "The ULP must set the base and bounds of the buffer when the
    STag is initialized to expose only the data to be retrieved."

ib_get_dma_mr(IB_ACCESS_REMOTE_WRITE) directly contravenes that
requirement.

I assume you mean when it creates a FRMR it is secure that it uses
IB_ACCESS_REMOTE_WRITE and the invalidates the MR before accessing it
- and I agree with that.

> >Why on earth is NFS using frmr to create RDMA READ lkeys?
> 
> Because NFS desires to have a single code path that works for all
> transports. In addition, using the IB dma_mr as an lkey means that
> large I/O (common with NFS) would require multiple RDMA Read
> operations, when the page list exceeded the local scatter/gather
> limit of the device.

So NFS got overwhelmed by API complexity and used a scheme that
penalizes all hardware, in all cases, rather than fall back on MR only
when absolutely necessary. Not really a ringing endorsement of status
quo....

> >I think when you do that, it quickly becomes clear that iWarp's
> >problem is not a seemingly minor issue with different flag bits, but
> >that iWarp *cannot* use local_dma_lkey as a RDMA READ local buffer.
> >Using ib_get_dma_mr(IB_ACCESS_REMOTE_WRITE) is an insecure work
> >around. So iWarp (and only iWarp) should take the pain of spinning up
> >temporary local MRs for RDMA READ.
> 
> That is entirely the wrong layer to address this. It would prevent
> reuse of MRs, and require upper layers to be aware that this was
> the case - which is exactly the opposite of what you are trying
> to achieve.

How So?

I'd hope for an API that is transparent to the upper layer, that lets
the driver/core do a variety of things.

If a call needs to create a temporary local MR, then the API should be
structured so the driver/core can do that. If the MR type benefits
from re-use then the core/driver should internally pool and re-use the
MRs.

Give me a reason why NFS should care about any of this? All it is
doing is issuing RDMA READ's and expecting that data to land in local
memory.

Concretely, I'd imagine something like

cookie = rdma_post_read(sendq,local_sg,remote_sg,wrid);
[..]
if (wc->wrid == wrid)
  rdma_complete_read(sendq,cookie);

And the core/driver will RDMA READ the remote addresses in remote_sg
list into the local_sg, using the best available strategy, and it
doesn't have limits like only a few SG entries because 'best
available strategy' includes using temporary MRs and multiple SQWEs.

The driver will pick the 'best available strategy' that suits the HW
it is driving, not NFS/iSER/SRP/Lustre.

Same basic story for SEND, WRITE and RECV.

> >This should all be hidden under a common API and it shouldn't be
> >sprinkled all over the place in NFS, iSER, etc.
> 
> Are you arguing that all upper layer storage protocols take a single
> common approach to memory registration? That is a very different
> discussion.

I'm arguing upper layer protocols should never even see local memory
registration, that it is totally irrelevant to them. So yes, you can
call that a common approach to memory registration if you like..

Basically it appears there is nothing that NFS can do to optimize that
process that cannot be done in the driver/core equally effectively and
shared between all ULPs. If you see something otherwise, I'm really
interested to hear about it.

Even your case of the MR trade off for S/G list limitations - that is
a performance point NFS has no buisness choosing. The driver is best
placed to know when to switch between S/G lists, multiple RDMA READs
and MR. The trade off will shift depending on HW limits:
 - Old mthca hardware is probably better to use multiple RDMA READ
 - mlx4 is probably better to use FRMR
 - mlx5 is likely best with indirect MR
 - All of the above are likely best to exhaust the S/G list first
 
The same basic argument is true of WRITE, SEND and RECV. If the S/G
list is exhausted then the API should transparently build a local MR
to linearize the buffer, and the API should be designed so the core
code can do that without the ULP having to be involved in those
details.

Is it possible?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 9, 2015, 12:03 a.m. UTC | #34
On Wed, Jul 08, 2015 at 01:32:05PM -0700, 'Christoph Hellwig' wrote:
> On Wed, Jul 08, 2015 at 01:08:42PM -0600, Jason Gunthorpe wrote:
> > Then, what is left is all remote MRs and maybe it will be clearer what
> > to do about them then...
> 
> From looking at that for a while the APIs needed seem pretty simple
> to me from a consumer perspective:
>
> struct rdma_mr *rmda_alloc_mr(struct ib_pd *pd, unsigned int nr_pages);
> void rdma_free_mr(struct rdma_mr *mr);

The major trouble with that is that the new MR types work by posting
work to the send queue, that work creates the MR.

I don't know all the details of how those schemes work, but it doesn't
look like it fits into this model ?

Maybe if rdma_register_sg was the only API and it accepted a QP, under
the hood it could re-use a pooled/create a non-queued (F)MR, or issue
FRMR, or work with indirect?

It would be nice if it is that simple :)

Sagi, are FMRs, FRMRs and Indirect MRs documented any place? I can't
recall off hand if FRMR was ever published by the IBTA?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig July 9, 2015, 8 a.m. UTC | #35
On Wed, Jul 08, 2015 at 06:03:37PM -0600, Jason Gunthorpe wrote:
> The major trouble with that is that the new MR types work by posting
> work to the send queue, that work creates the MR.
> 
> I don't know all the details of how those schemes work, but it doesn't
> look like it fits into this model ?

It sorta does, but we do need a QP argument.  I have implemented this
abstraction in a driver, but it's still usign driver specific data
structures at this point, so extracting all the arguments is a bit
of a mess.  But if there is some interest I can try to polish it up
and port an existing driver or two to it.

Note that post_send isn't really used for alloating the MR structure
but performing the registration.

> Maybe if rdma_register_sg was the only API and it accepted a QP, under
> the hood it could re-use a pooled/create a non-queued (F)MR, or issue
> FRMR, or work with indirect?

I really want to avoid another allocator in my driver.  With the blk-mq
model we basically preallocate a request with all resoruces, so right
now I totally avoid allocations in I/O path, which is something I'd
like to keep if possible.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 9, 2015, 8:46 a.m. UTC | #36
On 7/8/2015 8:14 PM, Hefty, Sean wrote:
>> I am still not clear if all of us agree that we need it.
>> Sean and Steve had some disclaimers...
>
> A single entry point doesn't help a whole lot if the app must deal with different behavior based on how the API is used.

It is true that different MRs will be used differently. However, not
once we have found ourselves extending an API to add functionality. This
means changing the API signature and changing all the call sites. Just
recently we saw this in Steve's mr_roles and in Matan's timestamping
support (changed ib_create_cq). When was the last time ib_create_qp API
was modified?

> We have a single entry point for post_send, for example, and that makes things worse.

I don't necessarily agree. the fact that post_send have multiple entry
points allows the consumer to concatenate a couple of those in a single
post. That's beneficial to get maximum performance from your HW.

> IMO, we need fewer registration *methods* not fewer calls.

I do agree that we need fewer registration methods. Let's review what
we have today:

- FRWR: which is *the* standard way to register memory. It's fast,
   non-blocking and has vast support.

- FMR: which is only supported in some Mellanox devices if I'm not
   mistaken, it's fast, but has slow invalidation (unmap). It is also not
   supported over VF.
   * FMR_POOL API was designed to address the slow unmap but created a
     week invalidation semantics (unmap is done only when some number of
     remapping is met).

- REG_PHYS_MR: which is supported by some devices. It actually
   combines both MR allocation and registration in a single call (it is
   the equivalent of user-space ibv_reg_mr)

I don't consider the dma_mr a registration method. It provides physical
memory access.

As for REG_PHYS_MR, this is the only synchronous registration method in
the kernel. It is a simple interface, which is currently used by 
xprtrdma when dma mr is not supported. We can consider removing it in
the future if it turns out to be non useful.

As for FMR/FMR_POOL, I'd much rather to wait until it becomes
deprecated on it's own rather than try and incorporate it in a
modernized API.

I think the stack can converge on FRWR as its primary registration
method. Let's focus on making it better.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 9, 2015, 8:47 a.m. UTC | #37
On 7/8/2015 11:32 PM, 'Christoph Hellwig' wrote:
> On Wed, Jul 08, 2015 at 01:08:42PM -0600, Jason Gunthorpe wrote:
>> Then, what is left is all remote MRs and maybe it will be clearer what
>> to do about them then...
>
>  From looking at that for a while the APIs needed seem pretty simple
> to me from a consumer perspective:
>
> struct rdma_mr *rmda_alloc_mr(struct ib_pd *pd, unsigned int nr_pages);

Why assuming that the mr spans pages?
I'd prefer if we consolidate on an existing mr allocation API
ib_create_mr() that receives a generic struct ib_mr_init_attr.

> void rdma_free_mr(struct rdma_mr *mr);
>
> /* updates *sg if the SG couldn't be fully registered due to offsets */
> int rdma_register_sg(struct rdma_mr *mr, struct scatterlist **sg,
> 		u32 *pkey, u32 *offset, u32 *len);

This is where I have a problem. Providing an API that may or may not
post a work request on my QP is confusing, and I don't understand its
semantics at all. Do I need to reserve slots on my QP? should I ask for
a completion? If we suppress the completion will I see an error
completion? What should I expect to find in the wr_id?

The stack is converging on FRWR as the primary registration method. Any
other methods will gradually be removed. I think that posting the work
request must never be implicit. We should focus on making the
registration work request better and hide the mechanics from the
consumer.

P.S. you probably didn't mean the argument pkey unless I'm completely
missing something...

> void rdma_unregister_sg(struct rdma_mr *mr, struct scatterlist *sg);

Same arguments here for local invalidate work request.

> Note that this assumes that the iSER bounce buffer hacks are replaced
> with the QUEUE_FLAG_SG_GAPS flags and a change to the SG_IO code to
> bounce buffer for vectored SG_IO calls.

Yea - that's on my todo's...
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 9, 2015, 8:58 a.m. UTC | #38
On 7/9/2015 3:03 AM, Jason Gunthorpe wrote:
> On Wed, Jul 08, 2015 at 01:32:05PM -0700, 'Christoph Hellwig' wrote:
>> On Wed, Jul 08, 2015 at 01:08:42PM -0600, Jason Gunthorpe wrote:
>>> Then, what is left is all remote MRs and maybe it will be clearer what
>>> to do about them then...
>>
>>  From looking at that for a while the APIs needed seem pretty simple
>> to me from a consumer perspective:
>>
>> struct rdma_mr *rmda_alloc_mr(struct ib_pd *pd, unsigned int nr_pages);
>> void rdma_free_mr(struct rdma_mr *mr);
>
> The major trouble with that is that the new MR types work by posting
> work to the send queue, that work creates the MR.
>
> I don't know all the details of how those schemes work, but it doesn't
> look like it fits into this model ?
>
> Maybe if rdma_register_sg was the only API and it accepted a QP, under
> the hood it could re-use a pooled/create a non-queued (F)MR, or issue
> FRMR, or work with indirect?

I think it's confusing to possibly do completely different registration
methods which has different implications on the consumer RDMA channel.

>
> It would be nice if it is that simple :)

It can be simple if we consolidate on a single method.

>
> Sagi, are FMRs, FRMRs and Indirect MRs documented any place? I can't
> recall off hand if FRMR was ever published by the IBTA?

I don't know if FMRs are in IBTA.
FRMRs are in under 10.6.3.7 FAST REGISTRATION (vol 1)
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 9, 2015, 11:02 a.m. UTC | #39
On 7/9/2015 2:36 AM, Jason Gunthorpe wrote:

>
> I'm arguing upper layer protocols should never even see local memory
> registration, that it is totally irrelevant to them. So yes, you can
> call that a common approach to memory registration if you like..
>
> Basically it appears there is nothing that NFS can do to optimize that
> process that cannot be done in the driver/core equally effectively and
> shared between all ULPs. If you see something otherwise, I'm really
> interested to hear about it.
>
> Even your case of the MR trade off for S/G list limitations - that is
> a performance point NFS has no buisness choosing. The driver is best
> placed to know when to switch between S/G lists, multiple RDMA READs
> and MR. The trade off will shift depending on HW limits:
>   - Old mthca hardware is probably better to use multiple RDMA READ
>   - mlx4 is probably better to use FRMR
>   - mlx5 is likely best with indirect MR
>   - All of the above are likely best to exhaust the S/G list first
>
> The same basic argument is true of WRITE, SEND and RECV. If the S/G
> list is exhausted then the API should transparently build a local MR
> to linearize the buffer, and the API should be designed so the core
> code can do that without the ULP having to be involved in those
> details.
>
> Is it possible?
>
> Jason
>

Jason,

We have protocol that involves remote memory keys transfer in their
standards so I don't see how we can remove it altogether from ULPs.

Putting that aside,

My main problem with this approach is that once you do non-trivial
things such as memory registration completely under the hood, it is
a slippery slope for device drivers.

If say a driver decides to register memory without the caller knowing,
it would need to post an extra work request on the send queue. So once
it sees the completion, it needs to silently consume it and have some
non trivial logic to invalidate it (another work request!) either from
poll_cq context or another thread.

Moreover, this also means that the driver needs to allocate bigger send
queues for possible future memory registration (depending on the IO
pattern maybe). And I really don't like an API that instructs the user
"please allocate some extra room in your send queue as I might need it".

This would also require the drivers to take a huristic approach on how
much memory registration resources are needed for all possible
consumers (ipoib, sdp, srp, iser, nfs, more...) which might have
different requirements.

I know that these are implementation details, but the point is that
vendor drivers can easily become a complete mess. I think we should
try to find a balanced approach where both consumers and providers are
not completely messed up.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever July 9, 2015, 1:52 p.m. UTC | #40
On Jul 9, 2015, at 4:46 AM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> On 7/8/2015 8:14 PM, Hefty, Sean wrote:
>>> I am still not clear if all of us agree that we need it.
>>> Sean and Steve had some disclaimers...
>> 
>> A single entry point doesn't help a whole lot if the app must deal with different behavior based on how the API is used.
> 
> It is true that different MRs will be used differently. However, not
> once we have found ourselves extending an API to add functionality. This
> means changing the API signature and changing all the call sites. Just
> recently we saw this in Steve's mr_roles and in Matan's timestamping
> support (changed ib_create_cq). When was the last time ib_create_qp API
> was modified?
> 
>> We have a single entry point for post_send, for example, and that makes things worse.
> 
> I don't necessarily agree. the fact that post_send have multiple entry
> points allows the consumer to concatenate a couple of those in a single
> post. That's beneficial to get maximum performance from your HW.
> 
>> IMO, we need fewer registration *methods* not fewer calls.
> 
> I do agree that we need fewer registration methods.

I also feel that would be a healthy direction.


> Let's review what we have today:
> 
> - FRWR: which is *the* standard way to register memory. It's fast,
>  non-blocking and has vast support.
> 
> - FMR: which is only supported in some Mellanox devices if I'm not
>  mistaken, it's fast, but has slow invalidation (unmap). It is also not
>  supported over VF.
>  * FMR_POOL API was designed to address the slow unmap but created a
>    week invalidation semantics (unmap is done only when some number of
>    remapping is met).
> 
> - REG_PHYS_MR: which is supported by some devices. It actually
>  combines both MR allocation and registration in a single call (it is
>  the equivalent of user-space ibv_reg_mr)

There is also Memory Windows, but there may no longer be any
kernel consumers of that memory registration method.


> I don't consider the dma_mr a registration method. It provides physical
> memory access.
> 
> As for REG_PHYS_MR, this is the only synchronous registration method in
> the kernel. It is a simple interface, which is currently used by xprtrdma when dma mr is not supported. We can consider removing it in
> the future if it turns out to be non useful.

Code review has shown the remaining ib_reg_phys_mr() call in xprtrdma
is never reached in the current code, and thus it will be removed
very soon.

There is one remaining kernel user of ib_reg_phys_mr() in 4.2: Lustre.


> As for FMR/FMR_POOL, I'd much rather to wait until it becomes
> deprecated on it's own rather than try and incorporate it in a
> modernized API.

> I think the stack can converge on FRWR as its primary registration
> method. Let's focus on making it better.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 9, 2015, 5:01 p.m. UTC | #41
On Thu, Jul 09, 2015 at 02:02:03PM +0300, Sagi Grimberg wrote:

> We have protocol that involves remote memory keys transfer in their
> standards so I don't see how we can remove it altogether from ULPs.

This is why I've been talking about local and remote MRs
differently. A Local MR is one where the Key is never put on the wire,
it exists soley to facilitate DMA between the CPU and the local HCA,
and it would never be needed if we had infinitely long S/G lists.

> My main problem with this approach is that once you do non-trivial
> things such as memory registration completely under the hood, it is
> a slippery slope for device drivers.

Yes, there is going to be some stuff going on, but the simplification
for the ULP side is incredible, it is certainly something that should
be explored and not dismissed without some really good reasons.

> If say a driver decides to register memory without the caller knowing,
> it would need to post an extra work request on the send queue.

Yes, the first issue is how to do flow control the sendq.

But this seems easily solved, every ULP is already tracking the
number of available entries in the senq, and it will not start new ops
until there is space, so instead of doing the computation internally
on how much space is needed to do X, we factor it out:

   if (rdma_sqes_post_read(...) < avail_sqe)
     avail_sqe -= rdma_post_read(...)
   else
     // Try again after completions advance

Every new-style post op is paired with a 'how many entires do I need'
call.

This is not a new concept, a ULP working with FRMR already has to know
it cannot start a FRMR using OP unless there is 2 SQE's
available. (and it has to make all this conditional on if it using
FRMR or something else). All this is doing is shifting the computation
of '2' out of the ULP and into the driver.

> So once it sees the completion, it needs to silently consume it and
> have some non trivial logic to invalidate it (another work request!)
> either from poll_cq context or another thread.

Completions are driven by the ULP. Every new-style post also has a
completion entry point. The ULP calls it when it knows the op has
done, either because the WRID it provided has signaled completed, or
because a later op has completed (signalling was supressed).

Since that may also be an implicitly posting API (if necessary, is
it?), it follows the same rules as above. This isn't changing
anything. ULPs would already have to drive invalidate posts from
completion with flow control, we are just moving the actul post
construction and computation of the needed SQEs out of the ULP.

> This would also require the drivers to take a huristic approach on how
> much memory registration resources are needed for all possible
> consumers (ipoib, sdp, srp, iser, nfs, more...) which might have
> different requirements.

That doesn't seem like a big issue. The ULP can give a hint on the PD
or QP what sort of usage it expects. 'Up to 16 RDMA READS' 'Up to 1MB
transfer per RDMA' and the core can use a pre-allocated pool scheme.

I was thinking about a pre-allocation for local here, as Christoph
suggests, I think that is a refinement we could certainly add on, once
there is some clear idea what allocations are acutally necessary to
spin up a temp MR. The basic issue I'd see is that the preallocation
would be done without knowledge of the desired SG list, but maybe some
kind of canonical 'max' SG could be used as a stand in...

Put together, it would look like this:
   if (rdma_sqes_post_read(...) < avail_sqe)
         avail_sqe -= rdma_post_read(qp,...,read_wrid)
  [.. fetch wcs ...]
   if (wc.wrid == read_wrid)
        if (rdma_sqes_post_complete_read(...,read_wrid) < avail_sqe)
	      rdma_post_complete_read(qp,...,read_wrid);
        else
	      // queue read_wrid for later rdma_post_complete_read

I'm not really seeing anything here that screams out this is
impossible, or performance is impacted, or it is too messy on either
the ULP or driver side.

Laid out like this, I think it even means we can nuke the IB DMA API
for these cases. rdma_post_read and rdma_post_complete_read are the
two points that need dma api calls (cache flushes), and they can just
do them internally.

This also tells me that the above call sites must already exist in
every ULP, so we, again, are not really substantially changing
core control flow for the ULP.

Are there more details that wreck the world?

Just to break it down:
  - rdma_sqes_post_read figures out how many SQEs are needed to post
    the specified RDMA READ.
      On IB, if the SG list can be used then this is always 1.
      If the RDMA READ is split into N RDMA READS then it is N.
      For something like iWarp this would be (?)
        * FRMR SQE
	* RDMA READ SQE
	* FRMR Invalidate (signaled)

      Presumably we can squeeze FMR and so forth into this scheme as
      well? They don't seem to use SQE's so it is looks simpler..

      Perhaps if an internal MR pool is exhausted this returns 0xFFFF
      and the caller will do a completion cycle, which may provide
      free MR's back to the pool. Ultimately once the SQ and CQ are
      totally drained the pool should be back to 100%?
  - rdma_post_read generates the necessary number of posts.
    The SQ must have the right number of entires available
    (see above)
  - rdma_post_complete_read is doing any clean up posts to make a MR
    ready to go again. Perhaps this isn't even posting?

    Semantically, I'd want to see rdma_post_complete_read returning to
    mean that the local read buffer is ready to go, and the ULP can
    start using it instantly. All invalidation is complete and all
    CPU caches are sync'd.

    This is where we'd start the recycling process for any temp MR,
    whatever that means..

I expect all these calls would be function pointers, and each driver
would provide a function pointer that is optimal for it's use. Eg mlx4
would provide a pointer that used the S/G list, then falls back to
FRMR if the S/G list is exhausted. The core code would provide a
toolbox of common functions the drivers can use here.

I didn't explore how errors work, but, I think, errors are just a
labeling exercise:
  if (wc is error && wc.wrid == read_wrid
     rdma_error_complete_read(...,read_wrid,wc)

Error recovery blows up the QP, so we just need to book keep and get
the MRs accounted for. The driver could do a synchronous clean up of
whatever mess is left during the next create_qp, or on the PD destroy.

> I know that these are implementation details, but the point is that
> vendor drivers can easily become a complete mess. I think we should
> try to find a balanced approach where both consumers and providers are
> not completely messed up.

Sure, but today vendor drivers and the core is trivial while ULPs are
an absolute mess.

Goal #1 should be to move all the mess into the API and support all
the existing methods. We should try as hard as possible to do that,
and if along the way, it is just isn't possible, then fine. But that
should be the first thing we try to reach for.

Just tidying FRMR so it unifies with indirect, is a fine consolation
prize, but I believe we can do better.

To your point in another message, I'd say, as long as the new API
supports FRMR at full speed with no performance penalty we are
good. If the other variants out there take a performance hit, then I
think that is OK. As you say, they are on the way out, we just need to
make sure that ULPs continue to work with FMR with the new API so
legacy cards don't completely break.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Talpey July 9, 2015, 8 p.m. UTC | #42
On 7/9/2015 1:01 PM, Jason Gunthorpe wrote:
> Laid out like this, I think it even means we can nuke the IB DMA API
> for these cases. rdma_post_read and rdma_post_complete_read are the
> two points that need dma api calls (cache flushes), and they can just
> do them internally.
>
> This also tells me that the above call sites must already exist in
> every ULP, so we, again, are not really substantially changing
> core control flow for the ULP.
>
> Are there more details that wreck the world?

Two things come to mind - PD's, and virtualization.

If there's no ib_get_dma_mr() call, what PD does the region get?
One could argue it inherits the QP's (Emulex proposed such a
PD-less MR in this year's OFS Developer's Workshop). But this
could impose new conditions on ULP's; they would have to be
aware of this affinity and it could affect their QP use.

More importantly, if a guest can post FRWR work requests with
physical addresses, what enforces their validity? The dma_mr
provides a PD but it also allows the host partition to interpose
on the call, setting up an IOMMU mapping, creating a new NIC TPT
mapping, etc. Without this, it may be possible for hostile guest
to forge FRMR's and attack the host, or other guests.

 > I didn't explore how errors work, but, I think, errors are just a
 > labeling exercise:
 >   if (wc is error && wc.wrid == read_wrid
 >      rdma_error_complete_read(...,read_wrid,wc)
 >
 > Error recovery blows up the QP, so we just need to book keep and get
 > the MRs accounted for. The driver could do a synchronous clean up of
 > whatever mess is left during the next create_qp, or on the PD destroy.

This is a subtle area. If the driver posts silenced completions as
you describe, there may not be a wc to reap. So either the driver or
the ULP will need to post a sentinel, the completion of which indicates
any prior silenced operations have actually done so. This can be
hard to get right. And if the driver posts everything signaled, well,
performance at high IOPS will be a challenge. The ULP is much better
positioned to manage that.

I'm with you on the flow control, btw. It's a new rule for the
ULP to obey, but probably not too onerous. Remember though, verbs
today return EAGAIN when the queue you're posting is full (a
terrible choice IMO). So upper layers don't actually need to
count WR's, unless they want to.

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 9, 2015, 9:16 p.m. UTC | #43
On Thu, Jul 09, 2015 at 04:00:37PM -0400, Tom Talpey wrote:
> On 7/9/2015 1:01 PM, Jason Gunthorpe wrote:
> >Laid out like this, I think it even means we can nuke the IB DMA API
> >for these cases. rdma_post_read and rdma_post_complete_read are the
> >two points that need dma api calls (cache flushes), and they can just
> >do them internally.
> >
> >This also tells me that the above call sites must already exist in
> >every ULP, so we, again, are not really substantially changing
> >core control flow for the ULP.
> >
> >Are there more details that wreck the world?
> 
> Two things come to mind - PD's, and virtualization.
> 
> If there's no ib_get_dma_mr() call, what PD does the region get?
> One could argue it inherits the QP's (Emulex proposed such a
> PD-less MR in this year's OFS Developer's Workshop). But this
> could impose new conditions on ULP's; they would have to be
> aware of this affinity and it could affect their QP use.

You'll have to educate me here. Why does the in-kernel ULP care about
what PD is used for lkey MRs?

This is a real question - every ULP in kernel seems to create a single
PD per device and thats it. (though NFS creates two, I assume that is
some odd internal split, not actually functional)

So, if there is a reason to have multiple PD's, nobody has done
anything with it in the last decade?

Thus, I think, for use on a lkey MR, it doesn't matter what PD is
used, as long as the adaptor will execute the request. If the driver
needs a PD, then using the one from the QP seems like it would cause
no problems.

> More importantly, if a guest can post FRWR work requests with
> physical addresses, what enforces their validity? The dma_mr
> provides a PD but it also allows the host partition to interpose
> on the call, setting up an IOMMU mapping, creating a new NIC TPT
> mapping, etc. Without this, it may be possible for hostile guest
> to forge FRMR's and attack the host, or other guests.

The MR concept doesn't go away, it just stops being exposed to the ULP.

Every PD would implictly create a ib_get_dma_mr that can handle all
local access. This isn't even a change, every ULP that creates a PD
also immediately creates a ib_get_dma_mr, I'm just moving that into
the alloc_pd call so the ULP never sees it.

FRMR doesn't change, other than the WR is created in the core/driver
layer, not the ULP. Whatever is put in that WR isn't going to change,
if it is secure today, then it will still be secure tomorrow.

Temporary MRs mean that rmda_post_read needs to be called in a context
where it can hypercall to create one, if the driver needs. I don't
think this is any different from the requirements the DMA API has? So
nothing changes. Not exactly sure what ULPs do here, but I'm expecting
they do this call before posting. Do any thread it?

Sagi's comment on indirect registrations seems like future
virtualization schemes will use a SQE post like FRMR to do this, but I
don't know the details.

So, nothing really seems to change here. What am I missing?

> > I didn't explore how errors work, but, I think, errors are just a
> > labeling exercise:
> >   if (wc is error && wc.wrid == read_wrid
> >      rdma_error_complete_read(...,read_wrid,wc)
> >
> > Error recovery blows up the QP, so we just need to book keep and get
> > the MRs accounted for. The driver could do a synchronous clean up of
> > whatever mess is left during the next create_qp, or on the PD destroy.
> 
> This is a subtle area. If the driver posts silenced completions as
> you describe, there may not be a wc to reap. So either the driver or
> the ULP will need to post a sentinel, the completion of which indicates
> any prior silenced operations have actually done so. This can be
> hard to get right. And if the driver posts everything signaled, well,
> performance at high IOPS will be a challenge. The ULP is much better
> positioned to manage that.

But again, nothing really changes. The ULP calls rdma_post_read, and
like every other ib_post_send type thing, it will ask for signaled
completion or not.

If yes, then the last WR in the chain gets marked, otherwise none are
marked. No different than what a ULP would do today when it builds WRs
manually.

The trick is, every call to rdma_post_read *MUST* be paired with a
call to either rdma_error_complete_read, or rdma_post_complete_read
with the wrid from above.

If the ULP is using supressed completion, then it must bookkeep this
stuff, and if it sees a completion for WRID+10, it must go through and
call rdma_post_complete_read for WRID+0,+1,+2, ...

Error unwind is similar..

> I'm with you on the flow control, btw. It's a new rule for the
> ULP to obey, but probably not too onerous. Remember though, verbs
> today return EAGAIN when the queue you're posting is full (a
> terrible choice IMO). So upper layers don't actually need to
> count WR's, unless they want to.

Yes, EAGAIN is ugly, but we cannot trivially support that at this
proposed API level. The issue is idempotency, if the core/driver needs
to push 3 SGEs on, and only 2 make it, then WTF does the ULP do?

To support EAGAIN, the core/drivers needs to support an all or nothing
ib_post_send, which is beyond the driver capability we have today...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford July 9, 2015, 10:18 p.m. UTC | #44
On 07/08/2015 08:03 PM, Jason Gunthorpe wrote:
> On Wed, Jul 08, 2015 at 01:32:05PM -0700, 'Christoph Hellwig' wrote:
>> On Wed, Jul 08, 2015 at 01:08:42PM -0600, Jason Gunthorpe wrote:
>>> Then, what is left is all remote MRs and maybe it will be clearer what
>>> to do about them then...
>>
>> From looking at that for a while the APIs needed seem pretty simple
>> to me from a consumer perspective:
>>
>> struct rdma_mr *rmda_alloc_mr(struct ib_pd *pd, unsigned int nr_pages);
>> void rdma_free_mr(struct rdma_mr *mr);
> 
> The major trouble with that is that the new MR types work by posting
> work to the send queue, that work creates the MR.
> 
> I don't know all the details of how those schemes work, but it doesn't
> look like it fits into this model ?
> 
> Maybe if rdma_register_sg was the only API and it accepted a QP, under
> the hood it could re-use a pooled/create a non-queued (F)MR, or issue
> FRMR, or work with indirect?
> 
> It would be nice if it is that simple :)
> 
> Sagi, are FMRs, FRMRs and Indirect MRs documented any place? I can't
> recall off hand if FRMR was ever published by the IBTA?

A lot of this discussion is interesting and has gone off in an area that
I think is worth pursuing in the long run.  However, in the short run,
this patch was a minor cleanup/simplification patch.  These discussions
are moving into complete redesigns and rearchitecting.  Steve, I'm OK
with the cleanup and would prefer that it stay separate from these
larger issues.

Baby steps, one things at a time.
Jason Gunthorpe July 9, 2015, 10:53 p.m. UTC | #45
On Thu, Jul 09, 2015 at 06:18:26PM -0400, Doug Ledford wrote:

> A lot of this discussion is interesting and has gone off in an area that
> I think is worth pursuing in the long run.  However, in the short run,
> this patch was a minor cleanup/simplification patch.  These discussions
> are moving into complete redesigns and rearchitecting.  Steve, I'm OK
> with the cleanup and would prefer that it stay separate from these
> larger issues.

So, I was originally of the view the flags change was fine.

But when I realized that it basically hides a
ib_get_dma_mr(IB_ACCESS_REMOTE_WRITE) behind an opaque API:

 rdma_get_dma_mr(RDMA_MRR_READ_DEST)

I changed my mind.

For iWarp the above creates a rkey that can RDMA WRITE to any place in
system memory. If a remote guesses that rkey your system is totally
compromised. So it is insecure, and contrary to the advice in RFC5042.

Seeing rdma_get_dma_mr(RDMA_MRR_READ_DEST) added all over the code
base terrifies me, because it is utterly wrong, and looks harmless.

The mistep, is that enabling iSER for iWarp is not just this trivial
change:

-	device->mr = ib_get_dma_mr(device->pd, IB_ACCESS_LOCAL_WRITE);
+	mr_roles = RDMA_MRR_RECV | RDMA_MRR_SEND | RDMA_MRR_WRITE_SOURCE |
+		   RDMA_MRR_READ_DEST;
+	device->mr = rdma_get_dma_mr(device->pd, mr_roles, 0);

But, it must also follow the path of NFS and use FRMR on iWarp for
RDMA READ lkeys. This is the real quirk of iWarp, not that the MR
flags are slightly different.

From there, it is a logical wish: If Steve is going to FRMR'ize iSER
to be acceptable security wise, I'd rather see that work done on in a
general way. Hence the API discussion.

What do you think?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Talpey July 10, 2015, 1:22 p.m. UTC | #46
On 7/9/2015 6:53 PM, Jason Gunthorpe wrote:
> On Thu, Jul 09, 2015 at 06:18:26PM -0400, Doug Ledford wrote:
>
>> A lot of this discussion is interesting and has gone off in an area that
>> I think is worth pursuing in the long run.  However, in the short run,
>> this patch was a minor cleanup/simplification patch.  These discussions
>> are moving into complete redesigns and rearchitecting.  Steve, I'm OK
>> with the cleanup and would prefer that it stay separate from these
>> larger issues.
>
> So, I was originally of the view the flags change was fine.
>
> But when I realized that it basically hides a
> ib_get_dma_mr(IB_ACCESS_REMOTE_WRITE) behind an opaque API:
>
>   rdma_get_dma_mr(RDMA_MRR_READ_DEST)
>
> I changed my mind.

Yes, I agree. Creating a writable privileged rmr for the RDMA Read sink
buffer exposes all of physical memory to a simple arithmetic mistake.
This should never be allowed in a production config.

That said, for the use by the NFS/RDMA Server, the risk is significantly
mitigated. The sink MR is never sent to the peer by the NFS upper layer,
it is protected by the connection's PD, and it is enabled only when the
RDMA Read is active. However, the very real risk remains. This should
definitely not be the standard API.

> For iWarp the above creates a rkey that can RDMA WRITE to any place in
> system memory. If a remote guesses that rkey your system is totally
> compromised. So it is insecure, and contrary to the advice in RFC5042.
>
> Seeing rdma_get_dma_mr(RDMA_MRR_READ_DEST) added all over the code
> base terrifies me, because it is utterly wrong, and looks harmless.
>
> The mistep, is that enabling iSER for iWarp is not just this trivial
> change:
>
> -	device->mr = ib_get_dma_mr(device->pd, IB_ACCESS_LOCAL_WRITE);
> +	mr_roles = RDMA_MRR_RECV | RDMA_MRR_SEND | RDMA_MRR_WRITE_SOURCE |
> +		   RDMA_MRR_READ_DEST;
> +	device->mr = rdma_get_dma_mr(device->pd, mr_roles, 0);
>

Yep. It's the only safe, correct solution.


 > But, it must also follow the path of NFS and use FRMR on iWarp for
 > RDMA READ lkeys. This is the real quirk of iWarp, not that the MR
 > flags are slightly different.
 >
>  From there, it is a logical wish: If Steve is going to FRMR'ize iSER
> to be acceptable security wise, I'd rather see that work done on in a
> general way. Hence the API discussion.

Well, it's important to realize that NFS already does the Right Thing.
So it isn't actually an urgent issue. There is time to discuss.

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 10, 2015, 4:11 p.m. UTC | #47
On Fri, Jul 10, 2015 at 09:22:24AM -0400, Tom Talpey wrote:
> and it is enabled only when the RDMA Read is active.

??? How is that done? ib_get_dma_mr is defined to return a remote
usable rkey that is valid from when it it returns until the MR is
destroyed. NFS creates one mr early on, it does not seem to make one
per RDMA READ?

For the proposed iSER patch the problem is very acute, iser makes a
single PD and phys MR at boot time for each device. This means there
is a single machine wide unchanging rkey that allows remote physical
memory access. An attacker only has to repeatedly open QPs to iSER and
guess rkey values until they find it. Add in likely non-crypto
randomness for rkeys, and I bet it isn't even that hard to do.

So this seems to be a catastrophic security failing.

> > From there, it is a logical wish: If Steve is going to FRMR'ize iSER
> >to be acceptable security wise, I'd rather see that work done on in a
> >general way. Hence the API discussion.
> 
> Well, it's important to realize that NFS already does the Right Thing.
> So it isn't actually an urgent issue. There is time to discuss.

It depends on what is going on with iWarp iSER. If the iWarp community
thinks it should go ahead insecurely then fine, put a giant warning in
dmesg and go ahead, otherwise iWarp needs to address this difference
with a proper generic API, which appears, must wrapper
ib_post_send. Harder job :\

I'm sorry Steve for leading you down the wrong path with these flags,
I did not fully realize exactly what the iWarp difference was at the
start :(

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford July 10, 2015, 5:56 p.m. UTC | #48
On 07/10/2015 12:11 PM, Jason Gunthorpe wrote:
> On Fri, Jul 10, 2015 at 09:22:24AM -0400, Tom Talpey wrote:
>> and it is enabled only when the RDMA Read is active.
> 
> ??? How is that done? ib_get_dma_mr is defined to return a remote
> usable rkey that is valid from when it it returns until the MR is
> destroyed. NFS creates one mr early on, it does not seem to make one
> per RDMA READ?
> 
> For the proposed iSER patch the problem is very acute, iser makes a
> single PD and phys MR at boot time for each device. This means there
> is a single machine wide unchanging rkey that allows remote physical
> memory access. An attacker only has to repeatedly open QPs to iSER and
> guess rkey values until they find it. Add in likely non-crypto
> randomness for rkeys, and I bet it isn't even that hard to do.
> 
> So this seems to be a catastrophic security failing.

As I recall, didn't the NFSoRDMA stack do even worse in the beginning
(meaning it previously had a memory model that was simply "register all
memory on the client and pass the rkey to the server and then tell the
server when/where to read/write", which was subsequently removed in
Chuck's NFSoRDMA cleanups over the last 5 or so kernel versions)?

I'm not bringing this up to downplay the current iSER issue, but to
demonstrate that we have long had a different trust model than the one
in RFC5042.  More below on that.

>>> From there, it is a logical wish: If Steve is going to FRMR'ize iSER
>>> to be acceptable security wise, I'd rather see that work done on in a
>>> general way. Hence the API discussion.
>>
>> Well, it's important to realize that NFS already does the Right Thing.
>> So it isn't actually an urgent issue. There is time to discuss.
> 
> It depends on what is going on with iWarp iSER. If the iWarp community
> thinks it should go ahead insecurely then fine, put a giant warning in
> dmesg and go ahead, otherwise iWarp needs to address this difference
> with a proper generic API, which appears, must wrapper
> ib_post_send. Harder job :\
> 
> I'm sorry Steve for leading you down the wrong path with these flags,
> I did not fully realize exactly what the iWarp difference was at the
> start :(

Are there security issues?  Yes.  Are we going to solve them in this
patch set?  No.  Especially since those security issues extend beyond
iSER + iWARP.

There are currently 4 RDMA capable link types/protocols.  Of those 4, 3
enforce locality (IB, OPA, RoCE all require lossless fabrics and cannot
be sent out into the wild internet).  RFC5042 is written with iWARP in
mind because it *can* be sent out over the Internet and therefore it is
possible for someone to place an RNIC card directly on the Internet and
it must therefore deal with all of the attack vectors this entails.

The linux RDMA stack, as a whole, is no where near RFC5042 compliant.
And it isn't just the kernel space at fault here.  Early versions of
opensm used to require that the master and slave machines must have
passwordless root ssh access to each other to copy around the guid2lid
file so when the master failed over, the slave would have an up to date
list.

Because 3 of the 4 RDMA technology types enforce localized clusters, the
majority of the stack and the implementations have utilized the cluster
as the point of trust.  In truth, I think it's simply been the easy way
to do things.  RFC5042 is very clearly a non-trust model where each
machine, and each connection, is assumed to be a threat.

So, long story short(ish):

While this discussion is taking place in the cleanup thread, part of it
involves the preceding patch set: iSER support for iWARP.  Patch 4/5 in
the iSER/iWARP series enables the remote write on device->mr
registration which constitutes the security issue we've been discussing.
 This cleanup patch set, then, is merely guilty of hiding that security
issue, not introducing it.

I think Steve and the other iSER folks should consider the wisdom of
including the iWARP support as it stands.  If they think that the common
use case is in a sequestered cluster, and that the current trust domain
is already not RFC5042 compliant, then I'm OK with enabling iSER support
for iWARP, but I would want there to be a very clear warning in the
dmesg about the trust domain issue.  I might even suggest that the iSER
driver, at least on iWARP connections, could possibly check the src/dst
addresses to make sure that they are in a private IP domain and if they
aren't, refuse to create a connection while emitting an additional
warning about the trust domain.

As far as the cleanup patches are concerned, I'm OK with those too.
However, since we would have added a large warning in the iSER code
about this issue, and that would then go away with the cleanup, the
warning would need to be generalized (such as "module <foo> is
requesting insecure iWARP mappings"), and appropriately WARN_ONCEd on
probably a per module basis or something like that.

From there, we can start to look at the bigger picture of cleanup up the
default trust domain in the kernel and user space both (and soliciting
feedback on that...I have a suspicion that some users will not like us
tightening up security as it might interfere with their usage in their
sequestered clusters).
Chuck Lever July 10, 2015, 6:34 p.m. UTC | #49
On Jul 10, 2015, at 1:56 PM, Doug Ledford <dledford@redhat.com> wrote:

> On 07/10/2015 12:11 PM, Jason Gunthorpe wrote:
>> On Fri, Jul 10, 2015 at 09:22:24AM -0400, Tom Talpey wrote:
>>> and it is enabled only when the RDMA Read is active.
>> 
>> ??? How is that done? ib_get_dma_mr is defined to return a remote
>> usable rkey that is valid from when it it returns until the MR is
>> destroyed. NFS creates one mr early on, it does not seem to make one
>> per RDMA READ?
>> 
>> For the proposed iSER patch the problem is very acute, iser makes a
>> single PD and phys MR at boot time for each device. This means there
>> is a single machine wide unchanging rkey that allows remote physical
>> memory access. An attacker only has to repeatedly open QPs to iSER and
>> guess rkey values until they find it. Add in likely non-crypto
>> randomness for rkeys, and I bet it isn't even that hard to do.
>> 
>> So this seems to be a catastrophic security failing.
> 
> As I recall, didn't the NFSoRDMA stack do even worse in the beginning
> (meaning it previously had a memory model that was simply "register all
> memory on the client and pass the rkey to the server and then tell the
> server when/where to read/write", which was subsequently removed in
> Chuck's NFSoRDMA cleanups over the last 5 or so kernel versions)?

This registration strategy was not removed. It is available and used
when the underlying HCA does not support either FRWR or FMR, or it
can be selected via sysctl.

I chose to leave it in for testing purposes. I know Christoph would
like to see it removed because of the security issues you mention
above. Others have asked that it be retained.

Certainly I can remove it as a fallback choice, leaving it so that
it can only be selected by sysctl.


> I'm not bringing this up to downplay the current iSER issue, but to
> demonstrate that we have long had a different trust model than the one
> in RFC5042.  More below on that.
> 
>>>> From there, it is a logical wish: If Steve is going to FRMR'ize iSER
>>>> to be acceptable security wise, I'd rather see that work done on in a
>>>> general way. Hence the API discussion.
>>> 
>>> Well, it's important to realize that NFS already does the Right Thing.
>>> So it isn't actually an urgent issue. There is time to discuss.
>> 
>> It depends on what is going on with iWarp iSER. If the iWarp community
>> thinks it should go ahead insecurely then fine, put a giant warning in
>> dmesg and go ahead, otherwise iWarp needs to address this difference
>> with a proper generic API, which appears, must wrapper
>> ib_post_send. Harder job :\
>> 
>> I'm sorry Steve for leading you down the wrong path with these flags,
>> I did not fully realize exactly what the iWarp difference was at the
>> start :(
> 
> Are there security issues?  Yes.  Are we going to solve them in this
> patch set?  No.  Especially since those security issues extend beyond
> iSER + iWARP.
> 
> There are currently 4 RDMA capable link types/protocols.  Of those 4, 3
> enforce locality (IB, OPA, RoCE all require lossless fabrics and cannot
> be sent out into the wild internet).  RFC5042 is written with iWARP in
> mind because it *can* be sent out over the Internet and therefore it is
> possible for someone to place an RNIC card directly on the Internet and
> it must therefore deal with all of the attack vectors this entails.
> 
> The linux RDMA stack, as a whole, is no where near RFC5042 compliant.
> And it isn't just the kernel space at fault here.  Early versions of
> opensm used to require that the master and slave machines must have
> passwordless root ssh access to each other to copy around the guid2lid
> file so when the master failed over, the slave would have an up to date
> list.
> 
> Because 3 of the 4 RDMA technology types enforce localized clusters, the
> majority of the stack and the implementations have utilized the cluster
> as the point of trust.  In truth, I think it's simply been the easy way
> to do things.  RFC5042 is very clearly a non-trust model where each
> machine, and each connection, is assumed to be a threat.
> 
> So, long story short(ish):
> 
> While this discussion is taking place in the cleanup thread, part of it
> involves the preceding patch set: iSER support for iWARP.  Patch 4/5 in
> the iSER/iWARP series enables the remote write on device->mr
> registration which constitutes the security issue we've been discussing.
> This cleanup patch set, then, is merely guilty of hiding that security
> issue, not introducing it.
> 
> I think Steve and the other iSER folks should consider the wisdom of
> including the iWARP support as it stands.  If they think that the common
> use case is in a sequestered cluster, and that the current trust domain
> is already not RFC5042 compliant, then I'm OK with enabling iSER support
> for iWARP, but I would want there to be a very clear warning in the
> dmesg about the trust domain issue.  I might even suggest that the iSER
> driver, at least on iWARP connections, could possibly check the src/dst
> addresses to make sure that they are in a private IP domain and if they
> aren't, refuse to create a connection while emitting an additional
> warning about the trust domain.
> 
> As far as the cleanup patches are concerned, I'm OK with those too.
> However, since we would have added a large warning in the iSER code
> about this issue, and that would then go away with the cleanup, the
> warning would need to be generalized (such as "module <foo> is
> requesting insecure iWARP mappings"), and appropriately WARN_ONCEd on
> probably a per module basis or something like that.
> 
> From there, we can start to look at the bigger picture of cleanup up the
> default trust domain in the kernel and user space both (and soliciting
> feedback on that...I have a suspicion that some users will not like us
> tightening up security as it might interfere with their usage in their
> sequestered clusters).
> 
> -- 
> Doug Ledford <dledford@redhat.com>
>              GPG KeyID: 0E572FDD
> 
> 

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Talpey July 10, 2015, 6:42 p.m. UTC | #50
On 7/10/2015 1:56 PM, Doug Ledford wrote:
> On 07/10/2015 12:11 PM, Jason Gunthorpe wrote:
>> On Fri, Jul 10, 2015 at 09:22:24AM -0400, Tom Talpey wrote:
>>> and it is enabled only when the RDMA Read is active.
>>
>> ??? How is that done? ib_get_dma_mr is defined to return a remote
>> usable rkey that is valid from when it it returns until the MR is
>> destroyed. NFS creates one mr early on, it does not seem to make one
>> per RDMA READ?

Actually you're right - the NFS server never tears down the MR. So, the
old code is just as vulnerable as the new code.

>> For the proposed iSER patch the problem is very acute, iser makes a
>> single PD and phys MR at boot time for each device. This means there
>> is a single machine wide unchanging rkey that allows remote physical
>> memory access. An attacker only has to repeatedly open QPs to iSER and
>> guess rkey values until they find it. Add in likely non-crypto
>> randomness for rkeys, and I bet it isn't even that hard to do.

The rkeys have a PD, wich cannot be forged, so it's not a matter of
attacking, but it is most definitely a system integrity risk, as I
mentioned earlier, a simple arithmetic offset mistake can overwrite
anything.

>> So this seems to be a catastrophic security failing.

s/security/integrity/ and I agree.

> As I recall, didn't the NFSoRDMA stack do even worse in the beginning
> (meaning it previously had a memory model that was simply "register all
> memory on the client and pass the rkey to the server and then tell the
> server when/where to read/write", which was subsequently removed in
> Chuck's NFSoRDMA cleanups over the last 5 or so kernel versions)?

Yes, Doug, shamefully it did, but

a) That was the client and Jason and I are talking about the server
b) When that code was written (2007), FRMR did not exist, and other
    memory registration methods had abysmal performance
c) It was heavily deprecated with console warnings emitted
d) Chuck did away with it since then.

> I'm not bringing this up to downplay the current iSER issue, but to
> demonstrate that we have long had a different trust model than the one
> in RFC5042.  More below on that.

I'll say something after requoting you.

>>>>  From there, it is a logical wish: If Steve is going to FRMR'ize iSER
>>>> to be acceptable security wise, I'd rather see that work done on in a
>>>> general way. Hence the API discussion.
>>>
>>> Well, it's important to realize that NFS already does the Right Thing.
>>> So it isn't actually an urgent issue. There is time to discuss.
>>
>> It depends on what is going on with iWarp iSER. If the iWarp community
>> thinks it should go ahead insecurely then fine, put a giant warning in
>> dmesg and go ahead, otherwise iWarp needs to address this difference
>> with a proper generic API, which appears, must wrapper
>> ib_post_send. Harder job :\
>>
>> I'm sorry Steve for leading you down the wrong path with these flags,
>> I did not fully realize exactly what the iWarp difference was at the
>> start :(
>
> Are there security issues?  Yes.  Are we going to solve them in this
> patch set?  No.  Especially since those security issues extend beyond
> iSER + iWARP.
>
> There are currently 4 RDMA capable link types/protocols.  Of those 4, 3
> enforce locality (IB, OPA, RoCE all require lossless fabrics and cannot
> be sent out into the wild internet).  RFC5042 is written with iWARP in
> mind because it *can* be sent out over the Internet and therefore it is
> possible for someone to place an RNIC card directly on the Internet and
> it must therefore deal with all of the attack vectors this entails.
>
> The linux RDMA stack, as a whole, is no where near RFC5042 compliant.
> And it isn't just the kernel space at fault here.  Early versions of
> opensm used to require that the master and slave machines must have
> passwordless root ssh access to each other to copy around the guid2lid
> file so when the master failed over, the slave would have an up to date
> list.
>
> Because 3 of the 4 RDMA technology types enforce localized clusters, the
> majority of the stack and the implementations have utilized the cluster
> as the point of trust.  In truth, I think it's simply been the easy way
> to do things.  RFC5042 is very clearly a non-trust model where each
> machine, and each connection, is assumed to be a threat.
>
> So, long story short(ish):
>
> While this discussion is taking place in the cleanup thread, part of it
> involves the preceding patch set: iSER support for iWARP.  Patch 4/5 in
> the iSER/iWARP series enables the remote write on device->mr
> registration which constitutes the security issue we've been discussing.
>   This cleanup patch set, then, is merely guilty of hiding that security
> issue, not introducing it.
>
> I think Steve and the other iSER folks should consider the wisdom of
> including the iWARP support as it stands.  If they think that the common
> use case is in a sequestered cluster, and that the current trust domain
> is already not RFC5042 compliant, then I'm OK with enabling iSER support
> for iWARP, but I would want there to be a very clear warning in the
> dmesg about the trust domain issue.  I might even suggest that the iSER
> driver, at least on iWARP connections, could possibly check the src/dst
> addresses to make sure that they are in a private IP domain and if they
> aren't, refuse to create a connection while emitting an additional
> warning about the trust domain.
>
> As far as the cleanup patches are concerned, I'm OK with those too.
> However, since we would have added a large warning in the iSER code
> about this issue, and that would then go away with the cleanup, the
> warning would need to be generalized (such as "module <foo> is
> requesting insecure iWARP mappings"), and appropriately WARN_ONCEd on
> probably a per module basis or something like that.
>
>  From there, we can start to look at the bigger picture of cleanup up the
> default trust domain in the kernel and user space both (and soliciting
> feedback on that...I have a suspicion that some users will not like us
> tightening up security as it might interfere with their usage in their
> sequestered clusters).

All excellent points. It's not worse, and it adds important transport
support.

However, it's an extremely bad idea to codify writable privileged rmr's
in the API as best practice. So under no circumstance should that become
the long term plan.

Tom.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig July 10, 2015, 7:34 p.m. UTC | #51
On Thu, Jul 09, 2015 at 09:52:59AM -0400, Chuck Lever wrote:
> There is one remaining kernel user of ib_reg_phys_mr() in 4.2: Lustre.

It's in the staging tree, which proper in-tree code doesn't have to
cater for.  So as soon as sunrpc is done using the interface we can and
should kill it off.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 10, 2015, 8:57 p.m. UTC | #52
On Fri, Jul 10, 2015 at 01:56:36PM -0400, Doug Ledford wrote:

> Are there security issues?  Yes.  Are we going to solve them in this
> patch set?  No.  Especially since those security issues extend beyond
> iSER + iWARP.

I think my biggest concern is we don't inadvertently open a security
hole on a machine that just happens to have an iwarp card installed,
but has nothing to do with HPC.

The clearest scary path I see is a server listening on a QP and using
IB_ACCESS_REMOTE_WRITE. That sure looks easy to exploit.

A client doing this.. It is alot harder to exploit.. iSER is client
only, so less worrying. Can anyone else think of a way to attack the
client?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford July 10, 2015, 10:27 p.m. UTC | #53
On 07/10/2015 04:57 PM, Jason Gunthorpe wrote:
> On Fri, Jul 10, 2015 at 01:56:36PM -0400, Doug Ledford wrote:
> 
>> Are there security issues?  Yes.  Are we going to solve them in this
>> patch set?  No.  Especially since those security issues extend beyond
>> iSER + iWARP.
> 
> I think my biggest concern is we don't inadvertently open a security
> hole on a machine that just happens to have an iwarp card installed,
> but has nothing to do with HPC.

Given the number of Chelsio cards utilized in things like TCP Offload
and iSCSI offload situations versus HPC, this isn't an unreasonable
thing to consider.  However, the attack vector is limited to a situation
where all of the below are true:

1) An RDMA connection exists or can be created (TOE and iSCSI offload do
not do this, so something else would have to be listening and accepting
incoming RDMA connections)
2) A global rkey exists (so it's not sufficient that any old RDMA app be
running, at least one app/ULP *must* create the global writable rkey)
3) The QP of the RDMA connection belongs to the same PD as the global
rkey (so our attack surface is limited to the bad server app/ULP and the
existence of this does not put other apps/ULPs at risk)
4) For maximum damage, the global rkey must also belong to an app/ULP
with elevated privilege or direct memory access (kernel ULPs are prime
targets, as a user app would have a mapped address space and the global
rkey in its domain would apply to it's process space, limiting the
damage that can be done).

Given all these requirements, I don't see this as a possibility.  In
order to be at risk, there *must* be a listening app/ULP, and we should
be able to print out a nice, dire warning in dmesg if that app/ULP opens
itself up in this way.

> The clearest scary path I see is a server listening on a QP and using
> IB_ACCESS_REMOTE_WRITE. That sure looks easy to exploit.

This goes back to the trust domain.  For a server, you simply can't
allow untrusted clients.  Period.  But that's easy enough to do with
access controls and connection filtering.  The app/ULP itself doesn't
even need to be filter aware as you can do the filtering in the TCP
stack on the primary listening socket using the netfilter tools.  And
that goes back to the kernel warning I referred to in my previous email.
 Let users know what module is making this risky decision and it becomes
easy to filter any ports that module's services are listening on.

> A client doing this.. It is alot harder to exploit.. iSER is client
> only, so less worrying. Can anyone else think of a way to attack the
> client?

TCP injection only is all I've got.
Doug Ledford July 10, 2015, 10:30 p.m. UTC | #54
On 07/10/2015 02:42 PM, Tom Talpey wrote:
> However, it's an extremely bad idea to codify writable privileged rmr's
> in the API as best practice. So under no circumstance should that become
> the long term plan.

Agree 100%.  Which is why I requested a big warning in the dmesg output
that pointed a big, accusatory finger at the offending module with the
long term goal of reforming the memory registration models to a method
that eliminates this problem in the kernel.
Doug Ledford July 11, 2015, 3:10 a.m. UTC | #55
On 07/10/2015 07:34 PM, Jason Gunthorpe wrote:
> On Fri, Jul 10, 2015 at 06:27:59PM -0400, Doug Ledford wrote:
> 
>> 1) An RDMA connection exists or can be created (TOE and iSCSI offload do
>> not do this, so something else would have to be listening and accepting
>> incoming RDMA connections)
>> 2) A global rkey exists (so it's not sufficient that any old RDMA app be
>> running, at least one app/ULP *must* create the global writable rkey)
>> 3) The QP of the RDMA connection belongs to the same PD as the global
>> rkey (so our attack surface is limited to the bad server app/ULP and the
>> existence of this does not put other apps/ULPs at risk)
>> 4) For maximum damage, the global rkey must also belong to an app/ULP
>> with elevated privilege or direct memory access (kernel ULPs are prime
>> targets, as a user app would have a mapped address space and the global
>> rkey in its domain would apply to it's process space, limiting the
>> damage that can be done).
> 
> This list looks right to me.

Good, we're in agreement so far ;-)

>> Given all these requirements, I don't see this as a possibility.  In
> 
> Hmm. So, if I put an iWarp card in a machine, boot any distro, and add
> something to /etc/exports ..
> 
> Does the NFS RDMA kernel module load and start a listening QP?

That depends on the OS.

> If not, that actually sounds like a bug. What if a distro fixes that?

Red Hat requires that you enable NFSoRDMA.  But, your point is valid.
That goes back to my point about the update patch set being very verbose
about the dangers of this.  The very obvious kernel message is helpful
in this case.

>>> The clearest scary path I see is a server listening on a QP and using
>>> IB_ACCESS_REMOTE_WRITE. That sure looks easy to exploit.
>>
>> This goes back to the trust domain.  For a server, you simply can't
>> allow untrusted clients.  Period.
> 
> This feels like an antiquated security model. Most things these days
> are using in-band mutual auth and then switch to a trusted mode. Both
> NFS and iSCSI support that (kerberos,chap,etc), I assume that generic
> support extends to RDMA.
> 
> If you use auth then trust as a security model, this rkey buisness is
> a problem, because you should not be vunerable to attack before
> authing.

Yes.  This is an antiquated security model.  There is a reason for that.
 It is easier to go simple to begin with and then add the extra layers
needed to keep things safe once you have the basics right.

>> access controls and connection filtering.  The app/ULP itself doesn't
>> even need to be filter aware as you can do the filtering in the TCP
>> stack on the primary listening socket using the netfilter tools.
> 
> Does netfilter work for iWarp? I'm surprised to hear that.

iWARP requires a normal TCP socket to connect to, then the client must
initiate an RDMA transfer, then a new connection is opened for the RDMA
transfers.  Blocking the parent dst:port/*:* will prevent these
connections.  If you are referring to allowing an untrusted client in
TCP mode but blocking them in RDMA mode, that's more complex and
requires app/ULP support.

>> that goes back to the kernel warning I referred to in my previous email.
>>  Let users know what module is making this risky decision and it becomes
>> easy to filter any ports that module's services are listening on.
> 
> That's fine, as long as the user is opting it this situation. If I'm
> not using RDMA but have an iWarp card, I should never see the
> message, even if RDMA support is autoloaded by my distro...

See above, an unexpected vulnerability caused by this model should
result in a very obvious kernel message.

>>> A client doing this.. It is alot harder to exploit.. iSER is client
>>> only, so less worrying. Can anyone else think of a way to attack the
>>> client?
>>
>> TCP injection only is all I've got.
> 
> Black hat server can also do it, and since it is possible before auth
> it is doable without the server auth credential.

Black hat server is beyond the scope of this discussion.
Christoph Hellwig July 11, 2015, 10:25 a.m. UTC | #56
On Thu, Jul 09, 2015 at 11:01:42AM -0600, Jason Gunthorpe wrote:
> To your point in another message, I'd say, as long as the new API
> supports FRMR at full speed with no performance penalty we are
> good. If the other variants out there take a performance hit, then I
> think that is OK. As you say, they are on the way out, we just need to
> make sure that ULPs continue to work with FMR with the new API so
> legacy cards don't completely break.

I think what we need to support for now is FRMR as the primary target,
and FMR as a secondard.  A few in-kernel drivers support physical
registrations, but it's always been clear that they've been a bad idea,
and the current discussion reconfirms that again.

So a FRMR-like API that allows to use FMRs underneath might be a good
start.  If the imput to that API are S/G lists as in my suggestion we'll
get indirect registration support almost for free.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise July 11, 2015, 4:37 p.m. UTC | #57
> On Jul 10, 2015, at 9:11 AM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
>> On Fri, Jul 10, 2015 at 09:22:24AM -0400, Tom Talpey wrote:
>> and it is enabled only when the RDMA Read is active.
> 
> ??? How is that done? ib_get_dma_mr is defined to return a remote
> usable rkey that is valid from when it it returns until the MR is
> destroyed. NFS creates one mr early on, it does not seem to make one
> per RDMA READ?
> 
> For the proposed iSER patch the problem is very acute, iser makes a
> single PD and phys MR at boot time for each device. This means there
> is a single machine wide unchanging rkey that allows remote physical
> memory access. An attacker only has to repeatedly open QPs to iSER and
> guess rkey values until they find it. Add in likely non-crypto
> randomness for rkeys, and I bet it isn't even that hard to do.
> 
> So this seems to be a catastrophic security failing.
> 
>>> From there, it is a logical wish: If Steve is going to FRMR'ize iSER
>>> to be acceptable security wise, I'd rather see that work done on in a
>>> general way. Hence the API discussion.
>> 
>> Well, it's important to realize that NFS already does the Right Thing.
>> So it isn't actually an urgent issue. There is time to discuss.
> 
> It depends on what is going on with iWarp iSER. If the iWarp community
> thinks it should go ahead insecurely then fine, put a giant warning in
> dmesg and go ahead, otherwise iWarp needs to address this difference
> with a proper generic API, which appears, must wrapper
> ib_post_send. Harder job :\
> 
> I'm sorry Steve for leading you down the wrong path with these flags,
> I did not fully realize exactly what the iWarp difference was at the
> start :(
> 
> Jason


No problem.  I'll work on iSER target FRMRs and repost the iSER series.  

Sagi, right now isert only uses FRMRs along with signature mrs.  I'll need to separate the two, I think.  Does that sound right?--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 12, 2015, 7:49 a.m. UTC | #58
On 7/10/2015 10:34 PM, Christoph Hellwig wrote:
> On Thu, Jul 09, 2015 at 09:52:59AM -0400, Chuck Lever wrote:
>> There is one remaining kernel user of ib_reg_phys_mr() in 4.2: Lustre.
>
> It's in the staging tree, which proper in-tree code doesn't have to
> cater for.  So as soon as sunrpc is done using the interface we can and
> should kill it off.
>

I think we should probably ask the Lustre folks if they have a real use
case for it before we remove it completely.

It seems that Lustre strives to FMRs and if it is not supported it uses
a PHYS_MR. Given FMRs has an expiration date and PHYS_MR is basically
alloc/free of MRs in the data path (which is obviously not desirable in
any form). Lustre will need to refresh their code to use the standard
FRWR.

Do we have Lustre folks listening on the mailing list?
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 12, 2015, 10:46 a.m. UTC | #59
On 7/11/2015 7:37 PM, Steve Wise wrote:
>
>> On Jul 10, 2015, at 9:11 AM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
>>
>>> On Fri, Jul 10, 2015 at 09:22:24AM -0400, Tom Talpey wrote:
>>> and it is enabled only when the RDMA Read is active.
>>
>> ??? How is that done? ib_get_dma_mr is defined to return a remote
>> usable rkey that is valid from when it it returns until the MR is
>> destroyed. NFS creates one mr early on, it does not seem to make one
>> per RDMA READ?
>>
>> For the proposed iSER patch the problem is very acute, iser makes a
>> single PD and phys MR at boot time for each device. This means there
>> is a single machine wide unchanging rkey that allows remote physical
>> memory access. An attacker only has to repeatedly open QPs to iSER and
>> guess rkey values until they find it. Add in likely non-crypto
>> randomness for rkeys, and I bet it isn't even that hard to do.
>>
>> So this seems to be a catastrophic security failing.
>>
>>>>  From there, it is a logical wish: If Steve is going to FRMR'ize iSER
>>>> to be acceptable security wise, I'd rather see that work done on in a
>>>> general way. Hence the API discussion.
>>>
>>> Well, it's important to realize that NFS already does the Right Thing.
>>> So it isn't actually an urgent issue. There is time to discuss.
>>
>> It depends on what is going on with iWarp iSER. If the iWarp community
>> thinks it should go ahead insecurely then fine, put a giant warning in
>> dmesg and go ahead, otherwise iWarp needs to address this difference
>> with a proper generic API, which appears, must wrapper
>> ib_post_send. Harder job :\
>>
>> I'm sorry Steve for leading you down the wrong path with these flags,
>> I did not fully realize exactly what the iWarp difference was at the
>> start :(
>>
>> Jason
>
>
> No problem.  I'll work on iSER target FRMRs and repost the iSER series.
>
> Sagi, right now isert only uses FRMRs along with signature mrs.  I'll need to separate the two, I think.  Does that sound right?

Yea.

Given that FRWR takes extra HW (and memory) resources, it
should probably be:

if (signature support || iwarp)
	use FRMR

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 13, 2015, 4:35 p.m. UTC | #60
On Sat, Jul 11, 2015 at 03:25:38AM -0700, 'Christoph Hellwig' wrote:

> So a FRMR-like API that allows to use FMRs underneath might be a good
> start.  If the imput to that API are S/G lists as in my suggestion we'll
> get indirect registration support almost for free.

That would be an excellent outcome in my view.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 13, 2015, 4:50 p.m. UTC | #61
On Sun, Jul 12, 2015 at 10:49:08AM +0300, Sagi Grimberg wrote:
> On 7/10/2015 10:34 PM, Christoph Hellwig wrote:
> >On Thu, Jul 09, 2015 at 09:52:59AM -0400, Chuck Lever wrote:
> >>There is one remaining kernel user of ib_reg_phys_mr() in 4.2: Lustre.
> >
> >It's in the staging tree, which proper in-tree code doesn't have to
> >cater for.  So as soon as sunrpc is done using the interface we can and
> >should kill it off.
> >
> 
> I think we should probably ask the Lustre folks if they have a real use
> case for it before we remove it completely.

I had an interesting conversation with some Lustre devs where they
were very concerned that FMR was going to be removed and they didn't really
seem to even know that FRMR existed.

I'm sure their PHYS_MR usage is crufty old code to support old
adaptors.

And again, here is a great example of how the API is not helping the
ULPs do what they want to do. Lustre doesn't care one bit about this
stuff, they just want to send messages..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 13, 2015, 5:18 p.m. UTC | #62
On Fri, Jul 10, 2015 at 11:10:23PM -0400, Doug Ledford wrote:

> >> access controls and connection filtering.  The app/ULP itself doesn't
> >> even need to be filter aware as you can do the filtering in the TCP
> >> stack on the primary listening socket using the netfilter tools.
> > 
> > Does netfilter work for iWarp? I'm surprised to hear that.
> 
> iWARP requires a normal TCP socket to connect to, then the client must
> initiate an RDMA transfer, then a new connection is opened for the RDMA
> transfers.  Blocking the parent dst:port/*:* will prevent these
> connections.  If you are referring to allowing an untrusted client in
> TCP mode but blocking them in RDMA mode, that's more complex and
> requires app/ULP support.

Yes, that would be the use case here. If someone wishes to deploy
auth-then-trust in TCP mode with NFS/iSCSI (which is a kernel
supported TCP/IP mode) we need to be absolutely certain there is no
way for anything untrusted to pivot a connection into a RDMA mode and
exploit the RKEY problem.

Out of the box this must be impossible. The surest way to guarentee
that is to have this hack off by default.

> Black hat server is beyond the scope of this discussion.

We cannot assume an all-trusted model here, there are many
configurations to deploy NFS/iSCSI that don't assume that. Even if you
assume it for the RDMA cases (which I stronlgy disagree with), it
still must be proven to not weaken the existing TCP/IP case.

So, a black hat server is on the table, attacking a client that the
admin is not intending to use with RDMA, by forcing it to switch to
RDMA before auth and exploiting the RDMA side.

This is where the iwarp guys have to analyze and come back to say it
is OK. Maybe iwarp can't get to rdma without auth or something...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Talpey July 13, 2015, 7:36 p.m. UTC | #63
On 7/11/2015 6:25 AM, 'Christoph Hellwig' wrote:
> I think what we need to support for now is FRMR as the primary target,
> and FMR as a secondar[y].

FMR is a *very* bad choice, for several reasons.

1) It's not supported by very many devices, in fact it might even
be thought to be obsolete.

2) It does not protect on byte boundaries, therefore it is unsafe
to use for anything but page-sized, page-aligned RDMA operations.

3) It is a synchronous API, i.e. it is not work-request based and
therefore is not very high performance.

4) It sometimes is used with a "pool", which defers deregistration
in the hopes of amortizing overhead. However, this deferral further
increases the risk of remote access, including altering of memory
contents after the fact.

Personally, I'd recommend ib_get_phys_mr() over FMR. It at least
doesn't suffer from issues 1, 2 and 4.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 13, 2015, 8:15 p.m. UTC | #64
On Mon, Jul 13, 2015 at 03:36:44PM -0400, Tom Talpey wrote:
> On 7/11/2015 6:25 AM, 'Christoph Hellwig' wrote:
> >I think what we need to support for now is FRMR as the primary target,
> >and FMR as a secondar[y].
> 
> FMR is a *very* bad choice, for several reasons.

If an API can transparently support FMR, then I think it can also
transparently support ib_get_phys_mr as an alternative, they look
pretty similar... ?

> Personally, I'd recommend ib_get_phys_mr() over FMR. It at least
> doesn't suffer from issues 1, 2 and 4.

Your comments are right for the rkey case, but for lkey, there is no
security concern with using a FMR, or pooling them. It doesn't look
like any iwarp drivers supports FMR, so they are certainly safe to use
on IB as the lkey.

This is why I am becoming more convinced that treating lkey
and rkey the same is not helpful.

Conversely, it looks like if we could drop ehca and mthca we could
ditch FMR entirely..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Talpey July 13, 2015, 10:23 p.m. UTC | #65
On 7/13/2015 1:18 PM, Jason Gunthorpe wrote:
> On Fri, Jul 10, 2015 at 11:10:23PM -0400, Doug Ledford wrote:
>> Black hat server is beyond the scope of this discussion.
>
> We cannot assume an all-trusted model here, there are many
> configurations to deploy NFS/iSCSI that don't assume that. Even if you
> assume it for the RDMA cases (which I stronlgy disagree with), it
> still must be proven to not weaken the existing TCP/IP case.
>
> So, a black hat server is on the table, attacking a client that the
> admin is not intending to use with RDMA, by forcing it to switch to
> RDMA before auth and exploiting the RDMA side.
>
> This is where the iwarp guys have to analyze and come back to say it
> is OK. Maybe iwarp can't get to rdma without auth or something...

Two observations.

One, auth is an Upper Layer matter. It's not the job of the transport
to authenticate the peer, the user, etc. Upper layers do this, and
iSCSI performs a login, NFSv4.1+ creates a session, SMB3 creates
sessions on multiple connections, etc. All this happens after the
connection is established.

Two, the iSCSI and NFSv4.1 protocols have explicit support for iWARP
"step-up" mode, which supports an initial connection in TCP (i.e.
with RDMA disabled), and switching to RDMA mode dynamically.

The IB and RoCE protocols do not support step-up mode, so in fact
one could argue that iWARP is *better* positioned to support this.

Tom.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig July 14, 2015, 7:37 a.m. UTC | #66
On Mon, Jul 13, 2015 at 03:36:44PM -0400, Tom Talpey wrote:
> On 7/11/2015 6:25 AM, 'Christoph Hellwig' wrote:
> >I think what we need to support for now is FRMR as the primary target,
> >and FMR as a secondar[y].
> 
> FMR is a *very* bad choice, for several reasons.
> 
> 1) It's not supported by very many devices, in fact it might even
> be thought to be obsolete.

It's support by the Mellanox adapters, which might be a minority of
the drivers, but at least in the field I've been in the aboslute
majority of the deployed hardware.

It's the default in the SRP initiator and iSER initiators, and the
primary fallback in the NFS client if FRs aren't available.

I don't claim to be an expert on memory registrations models, as they
are horribly, horrible documents (baiscally not at all in the source
tree), so my knowledge is from looking at and using implementations as
well as this useful writeup from the NFS folks:

http://wiki.linux-nfs.org/wiki/index.php/NfsRdmaClient/MemRegModes

Which misses or downplays an important restriction of PHYS
registrations: They dont allow coalescing non-contiguous memory
into a single maping, which makes them totally unsuitable for iSER
which only allows a single rkey/offset/len pair and requires
additional RDMA READ/WRITE roundtrips and larger S/G lists for other
protocols.

Based on looking at the consumers and the above table I think FMR
are still the better fallback for devices that don't support FR,
but supporting PHYS MRs is easy enough that adding it to a common
layer seems easy enough if some cares enough to regularly test it.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 14, 2015, 8:06 a.m. UTC | #67
On 7/13/2015 7:50 PM, Jason Gunthorpe wrote:
> On Sun, Jul 12, 2015 at 10:49:08AM +0300, Sagi Grimberg wrote:
>> On 7/10/2015 10:34 PM, Christoph Hellwig wrote:
>>> On Thu, Jul 09, 2015 at 09:52:59AM -0400, Chuck Lever wrote:
>>>> There is one remaining kernel user of ib_reg_phys_mr() in 4.2: Lustre.
>>>
>>> It's in the staging tree, which proper in-tree code doesn't have to
>>> cater for.  So as soon as sunrpc is done using the interface we can and
>>> should kill it off.
>>>
>>
>> I think we should probably ask the Lustre folks if they have a real use
>> case for it before we remove it completely.
>
> I had an interesting conversation with some Lustre devs where they
> were very concerned that FMR was going to be removed and they didn't really
> seem to even know that FRMR existed.
>
> I'm sure their PHYS_MR usage is crufty old code to support old
> adaptors.
>
> And again, here is a great example of how the API is not helping the
> ULPs do what they want to do. Lustre doesn't care one bit about this
> stuff, they just want to send messages..

All protocols cares about transferring data and sending messages, so
it's not a good enough reason for a poor registration method choice.
This just emphasizes why we need to converge to a single method.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 14, 2015, 9:10 a.m. UTC | #68
On 7/13/2015 11:15 PM, Jason Gunthorpe wrote:
> On Mon, Jul 13, 2015 at 03:36:44PM -0400, Tom Talpey wrote:
>> On 7/11/2015 6:25 AM, 'Christoph Hellwig' wrote:
>>> I think what we need to support for now is FRMR as the primary target,
>>> and FMR as a secondar[y].
>>
>> FMR is a *very* bad choice, for several reasons.
>
> If an API can transparently support FMR, then I think it can also
> transparently support ib_get_phys_mr as an alternative, they look
> pretty similar... ?

Having an API that does FRMR/FMR/PHYS_MR is even worse from the ULP
PoV. If you expose an API that might schedule (PHYS_MR) it limits the
context that the caller is allowed to call in.

I'm 100% against an registration API that *might* schedule.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 14, 2015, 9:22 a.m. UTC | #69
On 7/14/2015 10:37 AM, 'Christoph Hellwig' wrote:
> On Mon, Jul 13, 2015 at 03:36:44PM -0400, Tom Talpey wrote:
>> On 7/11/2015 6:25 AM, 'Christoph Hellwig' wrote:
>>> I think what we need to support for now is FRMR as the primary target,
>>> and FMR as a secondar[y].
>>
>> FMR is a *very* bad choice, for several reasons.
>>
>> 1) It's not supported by very many devices, in fact it might even
>> be thought to be obsolete.
>
> It's support by the Mellanox adapters,

*Older* Mellanox adapters. mlx5 (and future drivers I assume) will
no longer support FMRs.

> which might be a minority of
> the drivers, but at least in the field I've been in the aboslute
> majority of the deployed hardware.
>
> It's the default in the SRP initiator and iSER initiators, and the
> primary fallback in the NFS client if FRs aren't available.

We should gradually move away from that.

>
> I don't claim to be an expert on memory registrations models, as they
> are horribly, horrible documents (baiscally not at all in the source
> tree), so my knowledge is from looking at and using implementations as
> well as this useful writeup from the NFS folks:
>
> http://wiki.linux-nfs.org/wiki/index.php/NfsRdmaClient/MemRegModes
>
> Which misses or downplays an important restriction of PHYS
> registrations: They dont allow coalescing non-contiguous memory
> into a single maping, which makes them totally unsuitable for iSER
> which only allows a single rkey/offset/len pair and requires
> additional RDMA READ/WRITE roundtrips and larger S/G lists for other
> protocols.

I'm willing to add a proper memory_registration.txt under
Documentation/infiniband/ which would capture the items that
were brought up here.

>
> Based on looking at the consumers and the above table I think FMR
> are still the better fallback for devices that don't support FR,

It's better if you want it fast. I can't stress it enough, but IMO, the
fallback should *not* be in the API, but rather in the ULP.
Ideally, at some point it won't need to fall back, and we can remove
the API.

> but supporting PHYS MRs is easy enough that adding it to a common
> layer seems easy enough if some cares enough to regularly test it.
>

Someone needs a good reason to use it.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Talpey July 14, 2015, 12:12 p.m. UTC | #70
On 7/14/2015 5:22 AM, Sagi Grimberg wrote:
> On 7/14/2015 10:37 AM, 'Christoph Hellwig' wrote:
>> On Mon, Jul 13, 2015 at 03:36:44PM -0400, Tom Talpey wrote:
>>> On 7/11/2015 6:25 AM, 'Christoph Hellwig' wrote:
>>>> I think what we need to support for now is FRMR as the primary target,
>>>> and FMR as a secondar[y].
>>>
>>> FMR is a *very* bad choice, for several reasons.
>>>
>>> 1) It's not supported by very many devices, in fact it might even
>>> be thought to be obsolete.
>>
>> It's support by the Mellanox adapters,
>
> *Older* Mellanox adapters. mlx5 (and future drivers I assume) will
> no longer support FMRs.

Right, but drop the word "will". Mlx5 (the current ConnectX-4) doesn't
support FMR. It's only supported by mlx4 and mthca drivers.

>> Based on looking at the consumers and the above table I think FMR
>> are still the better fallback for devices that don't support FR,
>
> It's better if you want it fast.

Do you guys think FMR is actually "fast"? Because it's not. Measure it.
It might have been marginally faster than other schemes in like, 2007,
and only when the mempool was there to leave the registrations open.
Don't go back there.

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Talpey July 14, 2015, 12:24 p.m. UTC | #71
On 7/14/2015 4:06 AM, Sagi Grimberg wrote:
> All protocols cares about transferring data and sending messages, so
> it's not a good enough reason for a poor registration method choice.
> This just emphasizes why we need to converge to a single method.

In my opinion, we already have it.

For local registrations, ib_reg_phys_mr()/ib_get_dma_mr(). These are not
quite equivalent, btw.

For remote registrations, ib_post_send(FRMR).

Unfortunately, there exist ancient adapters in-tree that don't support
FRMR, and some ULPs have attempted to work on them. That's why the
situation is confusing.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 14, 2015, 1:21 p.m. UTC | #72
On 7/14/2015 3:24 PM, Tom Talpey wrote:
> On 7/14/2015 4:06 AM, Sagi Grimberg wrote:
>> All protocols cares about transferring data and sending messages, so
>> it's not a good enough reason for a poor registration method choice.
>> This just emphasizes why we need to converge to a single method.
>
> In my opinion, we already have it.
>
> For local registrations, ib_reg_phys_mr()/ib_get_dma_mr(). These are not
> quite equivalent, btw.
>
> For remote registrations, ib_post_send(FRMR).
>
> Unfortunately, there exist ancient adapters in-tree that don't support
> FRMR, and some ULPs have attempted to work on them. That's why the
> situation is confusing.
>

Exactly. This is why I'd prefer not to make an effort to have a unified
API that maintains any form of (confusing) fallback policy.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg July 14, 2015, 1:23 p.m. UTC | #73
On 7/14/2015 3:12 PM, Tom Talpey wrote:
> On 7/14/2015 5:22 AM, Sagi Grimberg wrote:
>> On 7/14/2015 10:37 AM, 'Christoph Hellwig' wrote:
>>> On Mon, Jul 13, 2015 at 03:36:44PM -0400, Tom Talpey wrote:
>>>> On 7/11/2015 6:25 AM, 'Christoph Hellwig' wrote:
>>>>> I think what we need to support for now is FRMR as the primary target,
>>>>> and FMR as a secondar[y].
>>>>
>>>> FMR is a *very* bad choice, for several reasons.
>>>>
>>>> 1) It's not supported by very many devices, in fact it might even
>>>> be thought to be obsolete.
>>>
>>> It's support by the Mellanox adapters,
>>
>> *Older* Mellanox adapters. mlx5 (and future drivers I assume) will
>> no longer support FMRs.
>
> Right, but drop the word "will". Mlx5 (the current ConnectX-4) doesn't
> support FMR. It's only supported by mlx4 and mthca drivers.
>
>>> Based on looking at the consumers and the above table I think FMR
>>> are still the better fallback for devices that don't support FR,
>>
>> It's better if you want it fast.
>
> Do you guys think FMR is actually "fast"? Because it's not. Measure it.
> It might have been marginally faster than other schemes in like, 2007,
> and only when the mempool was there to leave the registrations open.
> Don't go back there.

I agree, I meant the pool API which has obvious problems...

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise July 14, 2015, 2:45 p.m. UTC | #74
> 
> I'm willing to add a proper memory_registration.txt under
> Documentation/infiniband/ which would capture the items that
> were brought up here.
>

This would be great.  Documentation/infiniband is very sparse...
 


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig July 14, 2015, 3:36 p.m. UTC | #75
On Tue, Jul 14, 2015 at 12:10:36PM +0300, Sagi Grimberg wrote:
> Having an API that does FRMR/FMR/PHYS_MR is even worse from the ULP
> PoV. If you expose an API that might schedule (PHYS_MR) it limits the
> context that the caller is allowed to call in.
> 
> I'm 100% against an registration API that *might* schedule.

Oh, I had missed that PHYS_MR might sleep.  That might be the reasons
why everyone is avoiding them despite Tom preferring them over FMR.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig July 14, 2015, 3:40 p.m. UTC | #76
On Tue, Jul 14, 2015 at 12:22:14PM +0300, Sagi Grimberg wrote:
> It's better if you want it fast. I can't stress it enough, but IMO, the
> fallback should *not* be in the API, but rather in the ULP.
> Ideally, at some point it won't need to fall back, and we can remove
> the API.

But if all driver need to impement the same dispatch between
FR/FMR/ALLPHYS and having similar fast path hacks something is wrong.

Maybe we should start but just implementing the helper you suggest in
separae FR/FMR/ALLPHYS handlers while phasing out hacks like use
ALLPHYS for single segment registrations or if other registrations
fail.  And as a second step we can decide if their are similar enough
to consolidate them, or if we should simply kick out FMR and/or ALLPHYS
and thus not support any in-kernel consumers for old HCAs.

n the meantime let's kick off the now aready unuses memwindow in-kernel
API (I fear the userlevel one will have to stay forever) and the
about to be unused by proper in-kernel code PHYS MRs.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Talpey July 14, 2015, 3:47 p.m. UTC | #77
On 7/14/2015 11:36 AM, 'Christoph Hellwig' wrote:
> On Tue, Jul 14, 2015 at 12:10:36PM +0300, Sagi Grimberg wrote:
>> Having an API that does FRMR/FMR/PHYS_MR is even worse from the ULP
>> PoV. If you expose an API that might schedule (PHYS_MR) it limits the
>> context that the caller is allowed to call in.
>>
>> I'm 100% against an registration API that *might* schedule.
>
> Oh, I had missed that PHYS_MR might sleep.  That might be the reasons
> why everyone is avoiding them despite Tom preferring them over FMR.

Any synchronous verb might sleep. They issue commands to the adapter
across the PCIe bus, and need to wait for the result. FMR does, too.
At least the work-request-based ones are explicit about waiting for
the completion.

Are you saying you want an API that does memory registration without
blocking AND without sending a work request to hardware, ever?



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 14, 2015, 4:22 p.m. UTC | #78
On Tue, Jul 14, 2015 at 08:36:19AM -0700, 'Christoph Hellwig' wrote:
> Oh, I had missed that PHYS_MR might sleep.  That might be the reasons
> why everyone is avoiding them despite Tom preferring them over FMR.

Yep, almost certainly.

But even that is just a legacy of the bad API.

Even Sagi's API idea can accommodate this with enough driver effort,
but only if posting is combined:

- alloc_mr: Just set aside any memory the driver needs for PHYS_MR
- set_sg_list_and_post:
  + This would issue the PHYS_MR call to the NIC, but not
    sleep. It would adjust the SQ so that the tail pointer is blocked
    and the NIC doesn't see any more posts. Effectively, the SQ
    stalls. This way the ULP can post more stuff and ordering is preserved.
  + It would then setup a callback for PHYS_MR NIC completion and
    return, having never slept.
  + The callback will unblock the SQ

Yes, this is complicated, but it shows how combing MR setup and post
together lets us do a lot more.

Sadly, we could probably never do this for older drivers due to them
being unmaintained, but it is certainly possible the core could
provide an older driver wrapper that emulates the above with less
efficiency using threads/queues/etc.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise July 14, 2015, 7:25 p.m. UTC | #79
> >>
> >> I'm sorry Steve for leading you down the wrong path with these flags,
> >> I did not fully realize exactly what the iWarp difference was at the
> >> start :(
> >>
> >> Jason
> >
> >
> > No problem.  I'll work on iSER target FRMRs and repost the iSER series.
> >
> > Sagi, right now isert only uses FRMRs along with signature mrs.  I'll need to separate the two, I think.  Does that sound right?
> 
> Yea.
> 
> Given that FRWR takes extra HW (and memory) resources, it
> should probably be:
> 
> if (signature support || iwarp)
> 	use FRMR

Currently the code does:

if (device_supports_fastreg && device_supports_signature)
        use FRMR
else
        use DMAMR

Shouldn't we just recode it this way?

if (device_supports_fastreg)
        use FRMR
else
        use DMAMR

The benefit is that we don't have to check for iWARP protocol in the ULP.  The side effect is, I think, mlx4 will now use FRMR
instead of DMAMR for reads/writes since it doesn't support signature handover.



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 14, 2015, 7:29 p.m. UTC | #80
On Tue, Jul 14, 2015 at 02:25:50PM -0500, Steve Wise wrote:
 
> The benefit is that we don't have to check for iWARP protocol in the
> ULP.

IB should have to pay the cost of FRMR for lkey on RDMA READ, I'm
pretty sure you have to check for iWarp at somepoint..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise July 14, 2015, 7:32 p.m. UTC | #81
> 
> On Tue, Jul 14, 2015 at 02:25:50PM -0500, Steve Wise wrote:
> 
> > The benefit is that we don't have to check for iWARP protocol in the
> > ULP.
> 
> IB should have to pay the cost of FRMR for lkey on RDMA READ, I'm
> pretty sure you have to check for iWarp at somepoint..
> 

You mean "should not", yea?  

Ok.  I'll check for iWARP.  But don't tell me to remove the transport-specific hacks in this series when I post it! ;)



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 14, 2015, 7:37 p.m. UTC | #82
On Tue, Jul 14, 2015 at 02:32:31PM -0500, Steve Wise wrote:

> Ok.  I'll check for iWARP.  But don't tell me to remove the transport-specific hacks in this series when I post it! ;)

Sure thing! Wonder what the test should be? cap_remote_access_lkey? No idea.

Still think this should all be hidden in a wrapper for posting RDMA READ :|

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig July 14, 2015, 7:45 p.m. UTC | #83
On Tue, Jul 14, 2015 at 02:25:50PM -0500, Steve Wise wrote:
> if (device_supports_fastreg && device_supports_signature)
>         use FRMR
> else
>         use DMAMR
> 
> Shouldn't we just recode it this way?
> 
> if (device_supports_fastreg)
>         use FRMR
> else
>         use DMAMR

How does IB_DEVICE_LOCAL_DMA_LKEY (and ->local_dma_lkey) play into
this?  Seems like that should be the preferred option if supported.

Interestingly enough various iWarp driver seem to support this option,
what's the story behind that?  The (to me surprising) conclusion on
the list was that iWarp would always need a memory regireations that
also allows remove writes even for lkeys, but from looking at the
users of IB_DEVICE_LOCAL_DMA_LKEY / local_dma_lkey that seem to
prefer that over creating a MR.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig July 14, 2015, 7:55 p.m. UTC | #84
On Tue, Jul 14, 2015 at 02:32:31PM -0500, Steve Wise wrote:
> You mean "should not", yea?  
> 
> Ok.  I'll check for iWARP.  But don't tell me to remove the transport-specific hacks in this series when I post it! ;)

Just curious if there are any holes in this little scheme to deal with
the lkey mess:

 (1) make sure all drivers that currently do not set
     IB_DEVICE_LOCAL_DMA_LKEY but which can safely use ib_get_dma_mr
     call it underneath at device setup time, and tear it down before
     removal.
 (2) now ULD can check for IB_DEVICE_LOCAL_DMA_LKEY and use local_dma_lkey
     in that case, or will have to perform a per-IO MR with LOCAL and
     REMOTE flags if not
 (3) remove insecure remote uses of ib_get_dma_mr from ULDs
 (4) remove ib_get_dma_mr from the public API

This should help to sort out the lkey side of the memory registrations,
and isn't too hard.  For rkeys I'd love to go with something like Sagis
proposal as a first step, and then we can see if we can refine it in
another iteration.

Given that we might not be able to do the above for the next merge
window add your iWarp transport heck for now, at least we'll have a
clear plan to remove it.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 14, 2015, 7:57 p.m. UTC | #85
On Tue, Jul 14, 2015 at 12:45:12PM -0700, 'Christoph Hellwig' wrote:

> How does IB_DEVICE_LOCAL_DMA_LKEY (and ->local_dma_lkey) play into
> this?  Seems like that should be the preferred option if supported.

Very good question.

> Interestingly enough various iWarp driver seem to support this option,
> what's the story behind that?  The (to me surprising) conclusion on
> the list was that iWarp would always need a memory regireations that

Only for the RDMA READ lkey. All the other verbs are the same as IB,
and should use a local access all physical MR lkey. 

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise July 14, 2015, 7:58 p.m. UTC | #86
> -----Original Message-----
> From: 'Christoph Hellwig' [mailto:hch@infradead.org]
> Sent: Tuesday, July 14, 2015 2:45 PM
> To: Steve Wise
> Cc: 'Sagi Grimberg'; 'Steve Wise'; 'Jason Gunthorpe'; 'Tom Talpey'; 'Doug Ledford'; 'Christoph Hellwig'; sagig@mellanox.com;
> ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-devel@vger.kernel.org; linux-
> nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org; 'Oren Duer'
> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
> 
> On Tue, Jul 14, 2015 at 02:25:50PM -0500, Steve Wise wrote:
> > if (device_supports_fastreg && device_supports_signature)
> >         use FRMR
> > else
> >         use DMAMR
> >
> > Shouldn't we just recode it this way?
> >
> > if (device_supports_fastreg)
> >         use FRMR
> > else
> >         use DMAMR
> 
> How does IB_DEVICE_LOCAL_DMA_LKEY (and ->local_dma_lkey) play into
> this?  Seems like that should be the preferred option if supported.
>

The local_dma_lkey can be used in any rdma sge that requires an lkey.  It is supported for kernel-mode only.  So if you're only ever
going to use the lkey for IO, you really don't need a DMA_MR at all, unless you want to somehow partition up your work load by
protection domain.  But I claim for lkeys, the PD doesn't really protect anything since the remote peers can't use it anyway.  

> Interestingly enough various iWarp driver seem to support this option,
> what's the story behind that?  The (to me surprising) conclusion on
> the list was that iWarp would always need a memory regireations that
> also allows remove writes even for lkeys, but from looking at the
> users of IB_DEVICE_LOCAL_DMA_LKEY / local_dma_lkey that seem to
> prefer that over creating a MR.

There is confusion about lkeys and rkeys with regard to iWARP.  In the iWARP verbs, there is no distinction between an lkey and
rkey: they are the same key, called a Steering Tag or STAG.  When you create a MR, the lkey == rkey == STAG for iwarp transports.
Somewhat related, but really a different issue, is that SGEs that are the target of a read need REMOTE_WRITE access flags on their
STAG for iWARP. 

Clear as mud? :)


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise July 14, 2015, 8:10 p.m. UTC | #87
> -----Original Message-----
> From: 'Christoph Hellwig' [mailto:hch@infradead.org]
> Sent: Tuesday, July 14, 2015 2:55 PM
> To: Steve Wise
> Cc: 'Jason Gunthorpe'; 'Sagi Grimberg'; 'Steve Wise'; 'Tom Talpey'; 'Doug Ledford'; 'Christoph Hellwig'; sagig@mellanox.com;
> ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-devel@vger.kernel.org; linux-
> nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org; 'Oren Duer'
> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
> 
> On Tue, Jul 14, 2015 at 02:32:31PM -0500, Steve Wise wrote:
> > You mean "should not", yea?
> >
> > Ok.  I'll check for iWARP.  But don't tell me to remove the transport-specific hacks in this series when I post it! ;)
> 
> Just curious if there are any holes in this little scheme to deal with
> the lkey mess:
> 
>  (1) make sure all drivers that currently do not set
>      IB_DEVICE_LOCAL_DMA_LKEY but which can safely use ib_get_dma_mr
>      call it underneath at device setup time, and tear it down before
>      removal.
>  (2) now ULD can check for IB_DEVICE_LOCAL_DMA_LKEY and use local_dma_lkey
>      in that case, or will have to perform a per-IO MR with LOCAL and
>      REMOTE flags if not
>  (3) remove insecure remote uses of ib_get_dma_mr from ULDs
>  (4) remove ib_get_dma_mr from the public API
> 

Perhaps I missed some of the discussion on all this, but what is the point of #1?

Are these 4 steps intended to be (bisectable) steps / commits with the goal of removing ib_get_dma_mr()?  If so I still don't get
#1.

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 14, 2015, 8:41 p.m. UTC | #88
On Tue, Jul 14, 2015 at 02:58:23PM -0500, Steve Wise wrote:
> The local_dma_lkey can be used in any rdma sge that requires an
> lkey.

No, this is where iWarp doesn't follow the generic API - a local dma
lkey cannot be used with iWarp's RDMA_READ WR lkey. In effect RDMA
READ requires an *rkey* (confusingly stuck into the lkey slot) for
iWarp. (Right?)

*THAT* is really the core difference between IB and iWarp in this
area, not that the access flags are different.

(cap_rmda_read_lkey_is_rkey ?)

> domain.  But I claim for lkeys, the PD doesn't really protect
> anything since the remote peers can't use it anyway.

I agree.

To use a PD properly I'd have thought it should be created on a client
by client basis. The risk is tiny, but client X should not be able to
guess Y's RKey and then corrupt a data transfer. *Especially* on a
server if client X hasn't auth'd yet .... That is what the PD is for.

> There is confusion about lkeys and rkeys with regard to iWARP.  In
> the iWARP verbs, there is no distinction between an lkey and rkey:
> they are the same key, called a Steering Tag or STAG.  When you
> create a MR, the lkey == rkey == STAG for iwarp transports.
> Somewhat related, but really a different issue, is that SGEs that
> are the target of a read need REMOTE_WRITE access flags on their
> STAG for iWARP.

That is the clearest explanation for the iWarp difference I've seen so
far, thanks!

Christoph: The take away from all this is that on iWarp RDMA_READ
requires a restricted temporary MR to provide the lkey value in the
WC. It cannot use local_dma_lkey.

Everything else is the same between IB and iWarp: local_dma_lkey can
be used as the lkey for SEND, RECV, WRITE.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Talpey July 14, 2015, 8:46 p.m. UTC | #89
On 7/14/2015 3:32 PM, Steve Wise wrote:
>>
>> On Tue, Jul 14, 2015 at 02:25:50PM -0500, Steve Wise wrote:
>>
>>> The benefit is that we don't have to check for iWARP protocol in the
>>> ULP.
>>
>> IB should have to pay the cost of FRMR for lkey on RDMA READ, I'm
>> pretty sure you have to check for iWarp at somepoint..
>>
>
> You mean "should not", yea?
>
> Ok.  I'll check for iWARP.  But don't tell me to remove the transport-specific hacks in this series when I post it! ;)
>

FYI, in the Windows NDKPI (verbs-like kernel provider interface), there
is a device attribute defined as follows:

<https://msdn.microsoft.com/en-us/library/windows/hardware/hh439851(v=vs.85).aspx>

NDK_ADAPTER_FLAG_RDMA_READ_SINK_NOT_REQUIRED 0x00000002

Set if the provider does not require special access rights on the sink
buffer for an RDMA read request. When this flag is set, the consumer is
not required to use the NDK_MR_FLAG_RDMA_READ_SINK or
NDK_OP_FLAG_RDMA_READ_SINK flags when it registers sink buffers for
RDMA read requests. The consumer can also use logical address mappings
directly (with a token obtained with the
NDK_FN_GET_PRIVILEGED_MEMORY_REGION_TOKEN function) as RDMA read sink
buffers. This is similar to access to local buffers for RDMA write,
send, and receive operations.



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise July 14, 2015, 8:51 p.m. UTC | #90
> -----Original Message-----
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On Behalf Of Jason Gunthorpe
> Sent: Tuesday, July 14, 2015 3:42 PM
> To: Steve Wise
> Cc: 'Christoph Hellwig'; 'Sagi Grimberg'; 'Steve Wise'; 'Tom Talpey'; 'Doug Ledford'; sagig@mellanox.com; ogerlitz@mellanox.com;
> roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-devel@vger.kernel.org; linux-nfs@vger.kernel.org;
> trond.myklebust@primarydata.com; bfields@fieldses.org; 'Oren Duer'
> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
> 
> On Tue, Jul 14, 2015 at 02:58:23PM -0500, Steve Wise wrote:
> > The local_dma_lkey can be used in any rdma sge that requires an
> > lkey.
> 
> No, this is where iWarp doesn't follow the generic API - a local dma
> lkey cannot be used with iWarp's RDMA_READ WR lkey. In effect RDMA
> READ requires an *rkey* (confusingly stuck into the lkey slot) for
> iWarp. (Right?)
>

Right, a local_dma_lkey is not an rkey, and iwarp requires the rkey for the read destination MR.  Further that rkey needs
REMOTE_WRITE.

 
> *THAT* is really the core difference between IB and iWarp in this
> area, not that the access flags are different.
>

It both.  Because an rkey without REMOTE_WRITE would not work.
 
> (cap_rmda_read_lkey_is_rkey ?)
> 

IMO its more like rdma_cap_read_dest_needs_remote_write().  And maybe make it camel step too. ;)

> > domain.  But I claim for lkeys, the PD doesn't really protect
> > anything since the remote peers can't use it anyway.
> 
> I agree.
> 
> To use a PD properly I'd have thought it should be created on a client
> by client basis. The risk is tiny, but client X should not be able to
> guess Y's RKey and then corrupt a data transfer. *Especially* on a
> server if client X hasn't auth'd yet .... That is what the PD is for.
> 
> > There is confusion about lkeys and rkeys with regard to iWARP.  In
> > the iWARP verbs, there is no distinction between an lkey and rkey:
> > they are the same key, called a Steering Tag or STAG.  When you
> > create a MR, the lkey == rkey == STAG for iwarp transports.
> > Somewhat related, but really a different issue, is that SGEs that
> > are the target of a read need REMOTE_WRITE access flags on their
> > STAG for iWARP.
> 
> That is the clearest explanation for the iWarp difference I've seen so
> far, thanks!
> 
> Christoph: The take away from all this is that on iWarp RDMA_READ
> requires a restricted temporary MR to provide the lkey value in the
> WC. It cannot use local_dma_lkey.
>
> Everything else is the same between IB and iWarp: local_dma_lkey can
> be used as the lkey for SEND, RECV, WRITE.
>

The source of a WRITE can use local_dma_lkey, not the destination.   But this is true for IB and IW...



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise July 14, 2015, 9:01 p.m. UTC | #91
> -----Original Message-----
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On Behalf Of Steve Wise
> Sent: Tuesday, July 14, 2015 3:51 PM
> To: 'Jason Gunthorpe'
> Cc: 'Christoph Hellwig'; 'Sagi Grimberg'; 'Steve Wise'; 'Tom Talpey'; 'Doug Ledford'; sagig@mellanox.com; ogerlitz@mellanox.com;
> roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-devel@vger.kernel.org; linux-nfs@vger.kernel.org;
> trond.myklebust@primarydata.com; bfields@fieldses.org; 'Oren Duer'
> Subject: RE: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
> 
> 
> 
> > -----Original Message-----
> > From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On Behalf Of Jason Gunthorpe
> > Sent: Tuesday, July 14, 2015 3:42 PM
> > To: Steve Wise
> > Cc: 'Christoph Hellwig'; 'Sagi Grimberg'; 'Steve Wise'; 'Tom Talpey'; 'Doug Ledford'; sagig@mellanox.com; ogerlitz@mellanox.com;
> > roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-devel@vger.kernel.org; linux-nfs@vger.kernel.org;
> > trond.myklebust@primarydata.com; bfields@fieldses.org; 'Oren Duer'
> > Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
> >
> > On Tue, Jul 14, 2015 at 02:58:23PM -0500, Steve Wise wrote:
> > > The local_dma_lkey can be used in any rdma sge that requires an
> > > lkey.
> >
> > No, this is where iWarp doesn't follow the generic API - a local dma
> > lkey cannot be used with iWarp's RDMA_READ WR lkey. In effect RDMA
> > READ requires an *rkey* (confusingly stuck into the lkey slot) for
> > iWarp. (Right?)
> >
> 
> Right, a local_dma_lkey is not an rkey, and iwarp requires the rkey for the read destination MR.  Further that rkey needs
> REMOTE_WRITE.
>

BTW: What use is an IB rkey with no REMOTE_ flags set?  Can it be used somehow differently than the associated lkey?


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 14, 2015, 9:14 p.m. UTC | #92
On Tue, Jul 14, 2015 at 04:01:04PM -0500, Steve Wise wrote:

> > Right, a local_dma_lkey is not an rkey, and iwarp requires the
> > rkey for the read destination MR.  Further that rkey needs
> > REMOTE_WRITE.
> 
> BTW: What use is an IB rkey with no REMOTE_ flags set?  Can it be
> used somehow differently than the associated lkey?

Don't think so. Sagi?

This looks like it is just an artifact of the sloppy API that treats a
rkey MR and lkey MR as the same thing - they are clearly not, and we
should start talking about APIs that return lkeys or rkeys, never both
(and enforcing that lkey and rkey MRS have the right ACCESS flags).

Having looked at this for a bit now, I am of the view that it is very
hard to use a MR as both rkey and lkey without creating some kind of
security problem. At least every place in current ULPs that does this
is a security problem :) So the API should prevent it, IMHO.

local_dma_lkey is an excellent step, and if Sagi's MR unification
patch is careful to have a lkey/rkey API entry point for the two
usages we can maybe nuke this problem for good...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean July 23, 2015, 12:43 a.m. UTC | #93
> > This just emphasizes why we need to converge to a single method.
> 
> In my opinion, we already have it.
> 
> For local registrations, ib_reg_phys_mr()/ib_get_dma_mr(). These are not
> quite equivalent, btw.

Personally, I would work to eliminate local registration as part of the API.

> For remote registrations, ib_post_send(FRMR).

I slightly agree here.  IMO, the rkey should be obtained by associating the memory buffer with the QP, but not explicitly as a 'send' operation.

If queuing is a concern, we could expose max_active_registrations/max_total_registrations QP attributes.

I would hide the protection domain concept entirely.  If needed, a provider can allocate one PD per QP, with memory registration going to the PD.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean July 23, 2015, 6:53 p.m. UTC | #94
> There is confusion about lkeys and rkeys with regard to iWARP.  In the
> iWARP verbs, there is no distinction between an lkey and
> rkey: they are the same key, called a Steering Tag or STAG.  When you
> create a MR, the lkey == rkey == STAG for iwarp transports.
> Somewhat related, but really a different issue, is that SGEs that are the
> target of a read need REMOTE_WRITE access flags on their
> STAG for iWARP.
> 
> Clear as mud? :)

This may be a nit, but IMO, the use of the term 'rkey' versus 'stag' matters.  They convey different ways of finding a data buffer.  For example, do you locate a buffer using the stag, then verify that the offset + length fits into the target buffer?  Or do you locate the buffer by address, then verify that the key matches?

Consider if we allow an app to specify the rkey/stag, or reference the buffer using an offset, rather than a virtual address.

This seems to be part of the difference between an lkey and an rkey. 
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Wise July 23, 2015, 7:03 p.m. UTC | #95
> -----Original Message-----
> From: Hefty, Sean [mailto:sean.hefty@intel.com]
> Sent: Thursday, July 23, 2015 1:53 PM
> To: Steve Wise; 'Christoph Hellwig'
> Cc: 'Sagi Grimberg'; 'Steve Wise'; 'Jason Gunthorpe'; 'Tom Talpey'; 'Doug Ledford'; sagig@mellanox.com; ogerlitz@mellanox.com;
> roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-devel@vger.kernel.org; linux-nfs@vger.kernel.org;
> trond.myklebust@primarydata.com; bfields@fieldses.org; 'Oren Duer'
> Subject: RE: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
> 
> > There is confusion about lkeys and rkeys with regard to iWARP.  In the
> > iWARP verbs, there is no distinction between an lkey and
> > rkey: they are the same key, called a Steering Tag or STAG.  When you
> > create a MR, the lkey == rkey == STAG for iwarp transports.
> > Somewhat related, but really a different issue, is that SGEs that are the
> > target of a read need REMOTE_WRITE access flags on their
> > STAG for iWARP.
> >
> > Clear as mud? :)
> 
> This may be a nit, but IMO, the use of the term 'rkey' versus 'stag' matters.  They convey different ways of finding a data
buffer.  For
> example, do you locate a buffer using the stag, then verify that the offset + length fits into the target buffer?  

Yes.  HW always uses the stag to locate a record that contains the stag state (valid or invalid), the access flags, the 8b key, the
va_base, length, PBL describing the host pages, etc.  HW validates all that before using the buffer.  NOTE: An stag of 0 is the
special local-dma-lkey which HW treats differently: If the stag is 0, then the address in the SGE is the bus/dma address itself and
no lookup of a MR/PBL/etc is needed.   Stag 0 can ONLY be used by kernel users and MUST never be accepted/used from an ingress
packet and MUST never be emitted on the wire in a READ or WRITE.

> Or do you locate the buffer
> by address, then verify that the key matches?
> 

This is never done.

> Consider if we allow an app to specify the rkey/stag, or reference the buffer using an offset, rather than a virtual address.
> 
> This seems to be part of the difference between an lkey and an rkey.



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean July 23, 2015, 11:30 p.m. UTC | #96
> > This may be a nit, but IMO, the use of the term 'rkey' versus 'stag'
> matters.  They convey different ways of finding a data
> buffer.  For
> > example, do you locate a buffer using the stag, then verify that the
> offset + length fits into the target buffer?
> 
> Yes.  HW always uses the stag to locate a record that contains the stag
> state (valid or invalid), the access flags, the 8b key, the
> va_base, length, PBL describing the host pages, etc.  HW validates all
> that before using the buffer.  NOTE: An stag of 0 is the
> special local-dma-lkey which HW treats differently: If the stag is 0, then
> the address in the SGE is the bus/dma address itself and
> no lookup of a MR/PBL/etc is needed.   Stag 0 can ONLY be used by kernel
> users and MUST never be accepted/used from an ingress
> packet and MUST never be emitted on the wire in a READ or WRITE.
> 
> > Or do you locate the buffer
> > by address, then verify that the key matches?
> >
> 
> This is never done.

These were more rhetorical questions to highlight that stag is a better choice for the name than rkey.  Lkey seems better than stag, though I'd rather see lkey removed completely.  I don't see where usnic, ipath, qib, or opa technically need an lkey at all.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 23, 2015, 11:53 p.m. UTC | #97
On Thu, Jul 23, 2015 at 11:30:08PM +0000, Hefty, Sean wrote:

> I don't see where usnic, ipath, qib, or opa technically need an lkey
> at all.

The lkey is possibly useful if someone wants to do single op transfers
larger than the S/G limit of the SQE. I haven't noticed any ULPs doing
that..

qib family may not technically need a lkey, but those drivers are the
only modern drivers not to support IB_DEVICE_LOCAL_DMA_LKEY.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean July 24, 2015, 12:18 a.m. UTC | #98
> The lkey is possibly useful if someone wants to do single op transfers
> larger than the S/G limit of the SQE. I haven't noticed any ULPs doing
> that..

That changes how the buffer is identified, which gets back to my question of are we identifying local buffers by address or through some sort of iova/tag/descriptor/mr/whatever.  Do we have this right in the API?
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 24, 2015, 4:46 a.m. UTC | #99
On Fri, Jul 24, 2015 at 12:18:41AM +0000, Hefty, Sean wrote:
> > The lkey is possibly useful if someone wants to do single op transfers
> > larger than the S/G limit of the SQE. I haven't noticed any ULPs doing
> > that..
> 
> That changes how the buffer is identified, which gets back to my
> question of are we identifying local buffers by address or through
> some sort of iova/tag/descriptor/mr/whatever.  Do we have this right
> in the API?

After Sagi's work, and my patchset, all ULPs will talk about local
buffers in only two ways:
 - struct ib_sge w/ local_dma_lkey
 - struct scatterlist with memory registration to a simple ib_sge.
   (only done for iWarp RDMA READ? Maybe iSER DIX as well?)

Efforts to unify them have not been successful for fairly reasonable
reasons :)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index bac3fb4..f42595c 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -1144,6 +1144,36 @@  struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags)
 }
 EXPORT_SYMBOL(ib_get_dma_mr);
 
+int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs)
+{
+	int access_flags = attrs;
+
+	if (roles & RDMA_MRR_RECV)
+		access_flags |= IB_ACCESS_LOCAL_WRITE;
+
+	if (roles & RDMA_MRR_WRITE_DEST)
+		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
+
+	if (roles & RDMA_MRR_READ_DEST) {
+		access_flags |= IB_ACCESS_LOCAL_WRITE;
+		if (rdma_protocol_iwarp(pd->device,
+					rdma_start_port(pd->device)))
+			access_flags |= IB_ACCESS_REMOTE_WRITE;
+	}
+
+	if (roles & RDMA_MRR_READ_SOURCE)
+		access_flags |= IB_ACCESS_REMOTE_READ;
+
+	if (roles & RDMA_MRR_ATOMIC_DEST)
+		access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC;
+
+	if (roles & RDMA_MRR_MW_BIND)
+		access_flags |= IB_ACCESS_MW_BIND;
+
+	return access_flags;
+}
+EXPORT_SYMBOL(rdma_device_access_flags);
+
 struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd,
 			     struct ib_phys_buf *phys_buf_array,
 			     int num_phys_buf,
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 986fddb..da1eadf 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2494,6 +2494,112 @@  static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt)
 struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags);
 
 /**
+ * rdma_mr_roles - possible roles an RDMA MR will be used for
+ *
+ * This allows a transport independent RDMA application to
+ * create MRs that are usable for all the desired roles w/o
+ * having to understand which access rights are needed.
+ */
+enum {
+
+	/* lkey used in a ib_recv_wr sge */
+	RDMA_MRR_RECV			= 1,
+
+	/* lkey used for IB_WR_SEND* ops in the ib_send_wr sge */
+	RDMA_MRR_SEND			= (1<<1),
+
+	/* rkey used for IB_WR_RDMA_READ* ops in ib_send_wr wr.rdma.rkey */
+	RDMA_MRR_READ_SOURCE		= (1<<2),
+
+	/* lkey used for IB_WR_RDMA_READ* ops in the ib_send_wr sge */
+	RDMA_MRR_READ_DEST		= (1<<3),
+
+	/* lkey used for IB_WR_RDMA_WRITE* ops in the ib_send_wr sge */
+	RDMA_MRR_WRITE_SOURCE		= (1<<4),
+
+	/* rkey used for IB_WR_RDMA_WRITE* ops in ib_send_wr wr.rdma.rkey */
+	RDMA_MRR_WRITE_DEST		= (1<<5),
+
+	/*
+	 * lkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in the ib_send_wr sge
+	 */
+	RDMA_MRR_ATOMIC_SOURCE		= (1<<6),
+
+	/*
+	 * rkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in ib_send_wr
+	 * wr.atomic.rkey
+	 */
+	RDMA_MRR_ATOMIC_DEST		= (1<<7),
+
+	/* MR used for a IB_WR_MW_BIND in ib_send_wr wr.bind_mw.bind_info.mr */
+	RDMA_MRR_MW_BIND		= (1<<8),
+};
+
+/**
+ * rdma_mr_attributes - attributes for rdma memory regions
+ */
+enum {
+	RDMA_MRA_ZERO_BASED		= IB_ZERO_BASED,
+	RDMA_MRA_ACCESS_ON_DEMAND	= IB_ACCESS_ON_DEMAND,
+};
+
+/**
+ * rdma_device_access_flags - Returns the device-specific MR access flags.
+ * @pd: The protection domain associated with the memory region.
+ * @roles: The intended roles of the MR
+ * @attrs: The desired attributes of the MR
+ *
+ * Use the intended roles from @roles along with @attrs and the device
+ * capabilities to generate the needed access rights.
+ *
+ * Return: the ib_access_flags value needed to allocate the MR.
+ */
+int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs);
+
+/**
+ * rdma_get_dma_mr - Returns a memory region for system memory that is
+ * usable for DMA.
+ * @pd: The protection domain associated with the memory region.
+ * @roles: The intended roles of the MR
+ * @attrs: The desired attributes of the MR
+ *
+ * Use the intended roles from @roles along with @attrs and the device
+ * capabilities to define the needed access rights, and call
+ * ib_get_dma_mr() to allocate the MR.
+ *
+ * Note that the ib_dma_*() functions defined below must be used
+ * to create/destroy addresses used with the Lkey or Rkey returned
+ * by ib_get_dma_mr().
+ *
+ * Return: struct ib_mr pointer upon success, or a pointer encoded errno upon
+ * failure.
+ */
+static inline struct ib_mr *rdma_get_dma_mr(struct ib_pd *pd, int roles,
+					    int attrs)
+{
+	return ib_get_dma_mr(pd, rdma_device_access_flags(pd, roles, attrs));
+}
+
+/**
+ * rdma_fast_reg_access_flags - Returns the access flags needed for a fast
+ * register operation.
+ * @pd: The protection domain associated with the memory region.
+ * @roles: The intended roles of the MR
+ * @attrs: The desired attributes of the MR
+ *
+ * Use the intended roles from @roles along with @attrs and the device
+ * capabilities to define the needed access rights for a fast registration
+ * work request.
+ *
+ * Return: the ib_access_flags value needed for fast registration.
+ */
+static inline int rdma_fast_reg_access_flags(struct ib_pd *pd, int roles,
+					     int attrs)
+{
+	return rdma_device_access_flags(pd, roles, attrs);
+}
+
+/**
  * ib_dma_mapping_error - check a DMA addr for error
  * @dev: The device for which the dma_addr was created
  * @dma_addr: The DMA address to check