Message ID | 20150705232158.12029.25472.stgit@build2.ogc.int (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Sun, Jul 05, 2015 at 06:22:00PM -0500, Steve Wise wrote: > The semantics for MR access flags are not consistent across RDMA > protocols. So rather than have applications try and glean what they > need, have them pass in the intended roles and attributes for the MR to > be allocated and let the RDMA core select the appropriate access flags > given the roles, attributes, and device capabilities. > > We introduce rdma_mr_roles and rdma_mr_attributes that enumerate the > possible roles and attributes for a MR. These are documented in the > enums themselves. > > New services exported: > > rdma_device_access_flags() - given the intended MR roles and attributes > passed in, return the ib_access_flags bits for the device. > > rdma_get_dma_mr() - allocate a dma mr using the applications intended > MR roles and MR attributes. This uses rdma_device_access_flags(). > > rdma_fast_reg_access_flags() - return the ib_access_flags bits needed > for a fast register WR given the applications intended MR roles and > MR attributes. This uses rdma_device_access_flags(). > > Signed-off-by: Steve Wise <swise@opengridcomputing.com> > --- > > drivers/infiniband/core/verbs.c | 30 +++++++++++ > include/rdma/ib_verbs.h | 106 +++++++++++++++++++++++++++++++++++++++ > 2 files changed, 136 insertions(+), 0 deletions(-) > > diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c > index bac3fb4..f42595c 100644 > --- a/drivers/infiniband/core/verbs.c > +++ b/drivers/infiniband/core/verbs.c > @@ -1144,6 +1144,36 @@ struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags) > } > EXPORT_SYMBOL(ib_get_dma_mr); > > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs) > +{ > + int access_flags = attrs; Please add an assert for the values that are allowed for attrs. It also would be highly useful to add a kerneldoc comment describing the function and the parameters. Also __bitwise sparse tricks to ensure the right flags are passed wouldn't be a too bad idea. > +/** > + * rdma_mr_attributes - attributes for rdma memory regions > + */ > +enum { > + RDMA_MRA_ZERO_BASED = IB_ZERO_BASED, > + RDMA_MRA_ACCESS_ON_DEMAND = IB_ACCESS_ON_DEMAND, > +}; Why not specfiy these as separate nuerical namespace? > +/** > + * rdma_device_access_flags - Returns the device-specific MR access flags. > + * @pd: The protection domain associated with the memory region. > + * @roles: The intended roles of the MR > + * @attrs: The desired attributes of the MR > + * > + * Use the intended roles from @roles along with @attrs and the device > + * capabilities to generate the needed access rights. > + * > + * Return: the ib_access_flags value needed to allocate the MR. > + */ > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs); Oh, here we have kerneldoc comments, just in the wrong place. Please move them to the function defintion. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/07/2015 02:22, Steve Wise wrote: > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs) > +{ > + int access_flags = attrs; > + > + if (roles & RDMA_MRR_RECV) > + access_flags |= IB_ACCESS_LOCAL_WRITE; > + > + if (roles & RDMA_MRR_WRITE_DEST) > + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE; > + > + if (roles & RDMA_MRR_READ_DEST) { > + access_flags |= IB_ACCESS_LOCAL_WRITE; > + if (rdma_protocol_iwarp(pd->device, > + rdma_start_port(pd->device))) > + access_flags |= IB_ACCESS_REMOTE_WRITE; > + } > + > + if (roles & RDMA_MRR_READ_SOURCE) > + access_flags |= IB_ACCESS_REMOTE_READ; > + > + if (roles & RDMA_MRR_ATOMIC_DEST) > + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC; I think you need LOCAL_WRITE for ATOMIC_SOURCE in order to receive the results of the atomic operation. > + > + if (roles & RDMA_MRR_MW_BIND) > + access_flags |= IB_ACCESS_MW_BIND; > + > + return access_flags; > +} > +EXPORT_SYMBOL(rdma_device_access_flags); -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/6/2015 2:22 AM, Steve Wise wrote: > The semantics for MR access flags are not consistent across RDMA > protocols. So rather than have applications try and glean what they > need, have them pass in the intended roles and attributes for the MR to > be allocated and let the RDMA core select the appropriate access flags > given the roles, attributes, and device capabilities. > > We introduce rdma_mr_roles and rdma_mr_attributes that enumerate the > possible roles and attributes for a MR. These are documented in the > enums themselves. > > New services exported: > > rdma_device_access_flags() - given the intended MR roles and attributes > passed in, return the ib_access_flags bits for the device. > > rdma_get_dma_mr() - allocate a dma mr using the applications intended > MR roles and MR attributes. This uses rdma_device_access_flags(). > > rdma_fast_reg_access_flags() - return the ib_access_flags bits needed > for a fast register WR given the applications intended MR roles and > MR attributes. This uses rdma_device_access_flags(). > > Signed-off-by: Steve Wise <swise@opengridcomputing.com> > --- > > drivers/infiniband/core/verbs.c | 30 +++++++++++ > include/rdma/ib_verbs.h | 106 +++++++++++++++++++++++++++++++++++++++ > 2 files changed, 136 insertions(+), 0 deletions(-) > > diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c > index bac3fb4..f42595c 100644 > --- a/drivers/infiniband/core/verbs.c > +++ b/drivers/infiniband/core/verbs.c > @@ -1144,6 +1144,36 @@ struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags) > } > EXPORT_SYMBOL(ib_get_dma_mr); > > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs) > +{ > + int access_flags = attrs; > + > + if (roles & RDMA_MRR_RECV) > + access_flags |= IB_ACCESS_LOCAL_WRITE; > + > + if (roles & RDMA_MRR_WRITE_DEST) > + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE; > + > + if (roles & RDMA_MRR_READ_DEST) { > + access_flags |= IB_ACCESS_LOCAL_WRITE; > + if (rdma_protocol_iwarp(pd->device, > + rdma_start_port(pd->device))) > + access_flags |= IB_ACCESS_REMOTE_WRITE; > + } > + > + if (roles & RDMA_MRR_READ_SOURCE) > + access_flags |= IB_ACCESS_REMOTE_READ; > + > + if (roles & RDMA_MRR_ATOMIC_DEST) > + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC; > + > + if (roles & RDMA_MRR_MW_BIND) > + access_flags |= IB_ACCESS_MW_BIND; > + > + return access_flags; > +} > +EXPORT_SYMBOL(rdma_device_access_flags); > + > struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, > struct ib_phys_buf *phys_buf_array, > int num_phys_buf, > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > index 986fddb..da1eadf 100644 > --- a/include/rdma/ib_verbs.h > +++ b/include/rdma/ib_verbs.h > @@ -2494,6 +2494,112 @@ static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt) > struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags); > > /** > + * rdma_mr_roles - possible roles an RDMA MR will be used for > + * > + * This allows a transport independent RDMA application to > + * create MRs that are usable for all the desired roles w/o > + * having to understand which access rights are needed. > + */ > +enum { > + > + /* lkey used in a ib_recv_wr sge */ > + RDMA_MRR_RECV = 1, > + > + /* lkey used for IB_WR_SEND* ops in the ib_send_wr sge */ > + RDMA_MRR_SEND = (1<<1), > + > + /* rkey used for IB_WR_RDMA_READ* ops in ib_send_wr wr.rdma.rkey */ > + RDMA_MRR_READ_SOURCE = (1<<2), > + > + /* lkey used for IB_WR_RDMA_READ* ops in the ib_send_wr sge */ > + RDMA_MRR_READ_DEST = (1<<3), > + > + /* lkey used for IB_WR_RDMA_WRITE* ops in the ib_send_wr sge */ > + RDMA_MRR_WRITE_SOURCE = (1<<4), > + > + /* rkey used for IB_WR_RDMA_WRITE* ops in ib_send_wr wr.rdma.rkey */ > + RDMA_MRR_WRITE_DEST = (1<<5), > + > + /* > + * lkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in the ib_send_wr sge > + */ > + RDMA_MRR_ATOMIC_SOURCE = (1<<6), > + > + /* > + * rkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in ib_send_wr > + * wr.atomic.rkey > + */ > + RDMA_MRR_ATOMIC_DEST = (1<<7), > + > + /* MR used for a IB_WR_MW_BIND in ib_send_wr wr.bind_mw.bind_info.mr */ > + RDMA_MRR_MW_BIND = (1<<8), > +}; > + > +/** > + * rdma_mr_attributes - attributes for rdma memory regions > + */ > +enum { > + RDMA_MRA_ZERO_BASED = IB_ZERO_BASED, > + RDMA_MRA_ACCESS_ON_DEMAND = IB_ACCESS_ON_DEMAND, > +}; > + > +/** > + * rdma_device_access_flags - Returns the device-specific MR access flags. > + * @pd: The protection domain associated with the memory region. > + * @roles: The intended roles of the MR > + * @attrs: The desired attributes of the MR > + * > + * Use the intended roles from @roles along with @attrs and the device > + * capabilities to generate the needed access rights. > + * > + * Return: the ib_access_flags value needed to allocate the MR. > + */ > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs); > + > +/** > + * rdma_get_dma_mr - Returns a memory region for system memory that is > + * usable for DMA. > + * @pd: The protection domain associated with the memory region. > + * @roles: The intended roles of the MR > + * @attrs: The desired attributes of the MR > + * > + * Use the intended roles from @roles along with @attrs and the device > + * capabilities to define the needed access rights, and call > + * ib_get_dma_mr() to allocate the MR. > + * > + * Note that the ib_dma_*() functions defined below must be used > + * to create/destroy addresses used with the Lkey or Rkey returned > + * by ib_get_dma_mr(). > + * > + * Return: struct ib_mr pointer upon success, or a pointer encoded errno upon > + * failure. > + */ > +static inline struct ib_mr *rdma_get_dma_mr(struct ib_pd *pd, int roles, > + int attrs) > +{ > + return ib_get_dma_mr(pd, rdma_device_access_flags(pd, roles, attrs)); > +} > + I still think this wrapper should go away... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/6/2015 2:22 AM, Steve Wise wrote: > The semantics for MR access flags are not consistent across RDMA > protocols. So rather than have applications try and glean what they > need, have them pass in the intended roles and attributes for the MR to > be allocated and let the RDMA core select the appropriate access flags > given the roles, attributes, and device capabilities. > > We introduce rdma_mr_roles and rdma_mr_attributes that enumerate the > possible roles and attributes for a MR. These are documented in the > enums themselves. > > New services exported: > > rdma_device_access_flags() - given the intended MR roles and attributes > passed in, return the ib_access_flags bits for the device. > > rdma_get_dma_mr() - allocate a dma mr using the applications intended > MR roles and MR attributes. This uses rdma_device_access_flags(). > > rdma_fast_reg_access_flags() - return the ib_access_flags bits needed > for a fast register WR given the applications intended MR roles and > MR attributes. This uses rdma_device_access_flags(). > > Signed-off-by: Steve Wise <swise@opengridcomputing.com> > --- > > drivers/infiniband/core/verbs.c | 30 +++++++++++ > include/rdma/ib_verbs.h | 106 +++++++++++++++++++++++++++++++++++++++ > 2 files changed, 136 insertions(+), 0 deletions(-) > > diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c > index bac3fb4..f42595c 100644 > --- a/drivers/infiniband/core/verbs.c > +++ b/drivers/infiniband/core/verbs.c > @@ -1144,6 +1144,36 @@ struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags) > } > EXPORT_SYMBOL(ib_get_dma_mr); > > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs) > +{ > + int access_flags = attrs; > + > + if (roles & RDMA_MRR_RECV) > + access_flags |= IB_ACCESS_LOCAL_WRITE; > + > + if (roles & RDMA_MRR_WRITE_DEST) > + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE; > + > + if (roles & RDMA_MRR_READ_DEST) { > + access_flags |= IB_ACCESS_LOCAL_WRITE; > + if (rdma_protocol_iwarp(pd->device, > + rdma_start_port(pd->device))) > + access_flags |= IB_ACCESS_REMOTE_WRITE; > + } > + > + if (roles & RDMA_MRR_READ_SOURCE) > + access_flags |= IB_ACCESS_REMOTE_READ; > + > + if (roles & RDMA_MRR_ATOMIC_DEST) > + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC; > + > + if (roles & RDMA_MRR_MW_BIND) > + access_flags |= IB_ACCESS_MW_BIND; > + > + return access_flags; > +} > +EXPORT_SYMBOL(rdma_device_access_flags); > + > struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, > struct ib_phys_buf *phys_buf_array, > int num_phys_buf, > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > index 986fddb..da1eadf 100644 > --- a/include/rdma/ib_verbs.h > +++ b/include/rdma/ib_verbs.h > @@ -2494,6 +2494,112 @@ static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt) > struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags); > > /** > + * rdma_mr_roles - possible roles an RDMA MR will be used for > + * > + * This allows a transport independent RDMA application to > + * create MRs that are usable for all the desired roles w/o > + * having to understand which access rights are needed. > + */ > +enum { > + > + /* lkey used in a ib_recv_wr sge */ > + RDMA_MRR_RECV = 1, > + > + /* lkey used for IB_WR_SEND* ops in the ib_send_wr sge */ > + RDMA_MRR_SEND = (1<<1), > + > + /* rkey used for IB_WR_RDMA_READ* ops in ib_send_wr wr.rdma.rkey */ > + RDMA_MRR_READ_SOURCE = (1<<2), > + > + /* lkey used for IB_WR_RDMA_READ* ops in the ib_send_wr sge */ > + RDMA_MRR_READ_DEST = (1<<3), > + > + /* lkey used for IB_WR_RDMA_WRITE* ops in the ib_send_wr sge */ > + RDMA_MRR_WRITE_SOURCE = (1<<4), > + > + /* rkey used for IB_WR_RDMA_WRITE* ops in ib_send_wr wr.rdma.rkey */ > + RDMA_MRR_WRITE_DEST = (1<<5), > + > + /* > + * lkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in the ib_send_wr sge > + */ > + RDMA_MRR_ATOMIC_SOURCE = (1<<6), > + > + /* > + * rkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in ib_send_wr > + * wr.atomic.rkey > + */ > + RDMA_MRR_ATOMIC_DEST = (1<<7), > + > + /* MR used for a IB_WR_MW_BIND in ib_send_wr wr.bind_mw.bind_info.mr */ > + RDMA_MRR_MW_BIND = (1<<8), > +}; > + > +/** > + * rdma_mr_attributes - attributes for rdma memory regions > + */ > +enum { > + RDMA_MRA_ZERO_BASED = IB_ZERO_BASED, > + RDMA_MRA_ACCESS_ON_DEMAND = IB_ACCESS_ON_DEMAND, > +}; > + > +/** > + * rdma_device_access_flags - Returns the device-specific MR access flags. > + * @pd: The protection domain associated with the memory region. > + * @roles: The intended roles of the MR > + * @attrs: The desired attributes of the MR > + * > + * Use the intended roles from @roles along with @attrs and the device > + * capabilities to generate the needed access rights. > + * > + * Return: the ib_access_flags value needed to allocate the MR. > + */ > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs); > + > +/** > + * rdma_get_dma_mr - Returns a memory region for system memory that is > + * usable for DMA. > + * @pd: The protection domain associated with the memory region. > + * @roles: The intended roles of the MR > + * @attrs: The desired attributes of the MR > + * > + * Use the intended roles from @roles along with @attrs and the device > + * capabilities to define the needed access rights, and call > + * ib_get_dma_mr() to allocate the MR. > + * > + * Note that the ib_dma_*() functions defined below must be used > + * to create/destroy addresses used with the Lkey or Rkey returned > + * by ib_get_dma_mr(). > + * > + * Return: struct ib_mr pointer upon success, or a pointer encoded errno upon > + * failure. > + */ > +static inline struct ib_mr *rdma_get_dma_mr(struct ib_pd *pd, int roles, > + int attrs) > +{ > + return ib_get_dma_mr(pd, rdma_device_access_flags(pd, roles, attrs)); > +} > + > +/** > + * rdma_fast_reg_access_flags - Returns the access flags needed for a fast > + * register operation. > + * @pd: The protection domain associated with the memory region. > + * @roles: The intended roles of the MR > + * @attrs: The desired attributes of the MR > + * > + * Use the intended roles from @roles along with @attrs and the device > + * capabilities to define the needed access rights for a fast registration > + * work request. > + * > + * Return: the ib_access_flags value needed for fast registration. > + */ > +static inline int rdma_fast_reg_access_flags(struct ib_pd *pd, int roles, > + int attrs) > +{ > + return rdma_device_access_flags(pd, roles, attrs); > +} Why do we have rdma_fast_reg_access_flags() that just trampolines to rdma_device_access_flags()? why do we need it? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: Christoph Hellwig [mailto:hch@infradead.org] > Sent: Monday, July 06, 2015 12:25 AM > To: Steve Wise > Cc: dledford@redhat.com; sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; > eli@mellanox.com; target-devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org > Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags > > On Sun, Jul 05, 2015 at 06:22:00PM -0500, Steve Wise wrote: > > The semantics for MR access flags are not consistent across RDMA > > protocols. So rather than have applications try and glean what they > > need, have them pass in the intended roles and attributes for the MR to > > be allocated and let the RDMA core select the appropriate access flags > > given the roles, attributes, and device capabilities. > > > > We introduce rdma_mr_roles and rdma_mr_attributes that enumerate the > > possible roles and attributes for a MR. These are documented in the > > enums themselves. > > > > New services exported: > > > > rdma_device_access_flags() - given the intended MR roles and attributes > > passed in, return the ib_access_flags bits for the device. > > > > rdma_get_dma_mr() - allocate a dma mr using the applications intended > > MR roles and MR attributes. This uses rdma_device_access_flags(). > > > > rdma_fast_reg_access_flags() - return the ib_access_flags bits needed > > for a fast register WR given the applications intended MR roles and > > MR attributes. This uses rdma_device_access_flags(). > > > > Signed-off-by: Steve Wise <swise@opengridcomputing.com> > > --- > > > > drivers/infiniband/core/verbs.c | 30 +++++++++++ > > include/rdma/ib_verbs.h | 106 +++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 136 insertions(+), 0 deletions(-) > > > > diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c > > index bac3fb4..f42595c 100644 > > --- a/drivers/infiniband/core/verbs.c > > +++ b/drivers/infiniband/core/verbs.c > > @@ -1144,6 +1144,36 @@ struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags) > > } > > EXPORT_SYMBOL(ib_get_dma_mr); > > > > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs) > > +{ > > + int access_flags = attrs; > > Please add an assert for the values that are allowed for attrs. > > It also would be highly useful to add a kerneldoc comment describing > the function and the parameters. Also __bitwise sparse tricks > to ensure the right flags are passed wouldn't be a too bad idea. > Can you explain the "__bitwise sparse tricks"? Or point me to an example. > > +/** > > + * rdma_mr_attributes - attributes for rdma memory regions > > + */ > > +enum { > > + RDMA_MRA_ZERO_BASED = IB_ZERO_BASED, > > + RDMA_MRA_ACCESS_ON_DEMAND = IB_ACCESS_ON_DEMAND, > > +}; > > Why not specfiy these as separate nuerical namespace? > To avoid having to translate them. > > +/** > > + * rdma_device_access_flags - Returns the device-specific MR access flags. > > + * @pd: The protection domain associated with the memory region. > > + * @roles: The intended roles of the MR > > + * @attrs: The desired attributes of the MR > > + * > > + * Use the intended roles from @roles along with @attrs and the device > > + * capabilities to generate the needed access rights. > > + * > > + * Return: the ib_access_flags value needed to allocate the MR. > > + */ > > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs); > > Oh, here we have kerneldoc comments, just in the wrong place. Please > move them to the function defintion. Ok. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: Haggai Eran [mailto:haggaie@mellanox.com] > Sent: Monday, July 06, 2015 2:09 AM > To: Steve Wise; dledford@redhat.com > Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target- > devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org > Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags > > On 06/07/2015 02:22, Steve Wise wrote: > > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs) > > +{ > > + int access_flags = attrs; > > + > > + if (roles & RDMA_MRR_RECV) > > + access_flags |= IB_ACCESS_LOCAL_WRITE; > > + > > + if (roles & RDMA_MRR_WRITE_DEST) > > + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE; > > + > > + if (roles & RDMA_MRR_READ_DEST) { > > + access_flags |= IB_ACCESS_LOCAL_WRITE; > > + if (rdma_protocol_iwarp(pd->device, > > + rdma_start_port(pd->device))) > > + access_flags |= IB_ACCESS_REMOTE_WRITE; > > + } > > + > > + if (roles & RDMA_MRR_READ_SOURCE) > > + access_flags |= IB_ACCESS_REMOTE_READ; > > + > > + if (roles & RDMA_MRR_ATOMIC_DEST) > > + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC; > > I think you need LOCAL_WRITE for ATOMIC_SOURCE in order to receive the > results of the atomic operation. > Ok. I'm not familiar with atomics. I'll verify with the IB spec and update the code as needed. > > + > > + if (roles & RDMA_MRR_MW_BIND) > > + access_flags |= IB_ACCESS_MW_BIND; > > + > > + return access_flags; > > +} > > +EXPORT_SYMBOL(rdma_device_access_flags); -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: Sagi Grimberg [mailto:sagig@dev.mellanox.co.il] > Sent: Monday, July 06, 2015 2:54 AM > To: Steve Wise; dledford@redhat.com > Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target- > devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org > Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags > > On 7/6/2015 2:22 AM, Steve Wise wrote: > > The semantics for MR access flags are not consistent across RDMA > > protocols. So rather than have applications try and glean what they > > need, have them pass in the intended roles and attributes for the MR to > > be allocated and let the RDMA core select the appropriate access flags > > given the roles, attributes, and device capabilities. > > > > We introduce rdma_mr_roles and rdma_mr_attributes that enumerate the > > possible roles and attributes for a MR. These are documented in the > > enums themselves. > > > > New services exported: > > > > rdma_device_access_flags() - given the intended MR roles and attributes > > passed in, return the ib_access_flags bits for the device. > > > > rdma_get_dma_mr() - allocate a dma mr using the applications intended > > MR roles and MR attributes. This uses rdma_device_access_flags(). > > > > rdma_fast_reg_access_flags() - return the ib_access_flags bits needed > > for a fast register WR given the applications intended MR roles and > > MR attributes. This uses rdma_device_access_flags(). > > > > Signed-off-by: Steve Wise <swise@opengridcomputing.com> > > --- > > > > drivers/infiniband/core/verbs.c | 30 +++++++++++ > > include/rdma/ib_verbs.h | 106 +++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 136 insertions(+), 0 deletions(-) > > > > diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c > > index bac3fb4..f42595c 100644 > > --- a/drivers/infiniband/core/verbs.c > > +++ b/drivers/infiniband/core/verbs.c > > @@ -1144,6 +1144,36 @@ struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags) > > } > > EXPORT_SYMBOL(ib_get_dma_mr); > > > > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs) > > +{ > > + int access_flags = attrs; > > + > > + if (roles & RDMA_MRR_RECV) > > + access_flags |= IB_ACCESS_LOCAL_WRITE; > > + > > + if (roles & RDMA_MRR_WRITE_DEST) > > + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE; > > + > > + if (roles & RDMA_MRR_READ_DEST) { > > + access_flags |= IB_ACCESS_LOCAL_WRITE; > > + if (rdma_protocol_iwarp(pd->device, > > + rdma_start_port(pd->device))) > > + access_flags |= IB_ACCESS_REMOTE_WRITE; > > + } > > + > > + if (roles & RDMA_MRR_READ_SOURCE) > > + access_flags |= IB_ACCESS_REMOTE_READ; > > + > > + if (roles & RDMA_MRR_ATOMIC_DEST) > > + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC; > > + > > + if (roles & RDMA_MRR_MW_BIND) > > + access_flags |= IB_ACCESS_MW_BIND; > > + > > + return access_flags; > > +} > > +EXPORT_SYMBOL(rdma_device_access_flags); > > + > > struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, > > struct ib_phys_buf *phys_buf_array, > > int num_phys_buf, > > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > > index 986fddb..da1eadf 100644 > > --- a/include/rdma/ib_verbs.h > > +++ b/include/rdma/ib_verbs.h > > @@ -2494,6 +2494,112 @@ static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt) > > struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags); > > > > /** > > + * rdma_mr_roles - possible roles an RDMA MR will be used for > > + * > > + * This allows a transport independent RDMA application to > > + * create MRs that are usable for all the desired roles w/o > > + * having to understand which access rights are needed. > > + */ > > +enum { > > + > > + /* lkey used in a ib_recv_wr sge */ > > + RDMA_MRR_RECV = 1, > > + > > + /* lkey used for IB_WR_SEND* ops in the ib_send_wr sge */ > > + RDMA_MRR_SEND = (1<<1), > > + > > + /* rkey used for IB_WR_RDMA_READ* ops in ib_send_wr wr.rdma.rkey */ > > + RDMA_MRR_READ_SOURCE = (1<<2), > > + > > + /* lkey used for IB_WR_RDMA_READ* ops in the ib_send_wr sge */ > > + RDMA_MRR_READ_DEST = (1<<3), > > + > > + /* lkey used for IB_WR_RDMA_WRITE* ops in the ib_send_wr sge */ > > + RDMA_MRR_WRITE_SOURCE = (1<<4), > > + > > + /* rkey used for IB_WR_RDMA_WRITE* ops in ib_send_wr wr.rdma.rkey */ > > + RDMA_MRR_WRITE_DEST = (1<<5), > > + > > + /* > > + * lkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in the ib_send_wr sge > > + */ > > + RDMA_MRR_ATOMIC_SOURCE = (1<<6), > > + > > + /* > > + * rkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in ib_send_wr > > + * wr.atomic.rkey > > + */ > > + RDMA_MRR_ATOMIC_DEST = (1<<7), > > + > > + /* MR used for a IB_WR_MW_BIND in ib_send_wr wr.bind_mw.bind_info.mr */ > > + RDMA_MRR_MW_BIND = (1<<8), > > +}; > > + > > +/** > > + * rdma_mr_attributes - attributes for rdma memory regions > > + */ > > +enum { > > + RDMA_MRA_ZERO_BASED = IB_ZERO_BASED, > > + RDMA_MRA_ACCESS_ON_DEMAND = IB_ACCESS_ON_DEMAND, > > +}; > > + > > +/** > > + * rdma_device_access_flags - Returns the device-specific MR access flags. > > + * @pd: The protection domain associated with the memory region. > > + * @roles: The intended roles of the MR > > + * @attrs: The desired attributes of the MR > > + * > > + * Use the intended roles from @roles along with @attrs and the device > > + * capabilities to generate the needed access rights. > > + * > > + * Return: the ib_access_flags value needed to allocate the MR. > > + */ > > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs); > > + > > +/** > > + * rdma_get_dma_mr - Returns a memory region for system memory that is > > + * usable for DMA. > > + * @pd: The protection domain associated with the memory region. > > + * @roles: The intended roles of the MR > > + * @attrs: The desired attributes of the MR > > + * > > + * Use the intended roles from @roles along with @attrs and the device > > + * capabilities to define the needed access rights, and call > > + * ib_get_dma_mr() to allocate the MR. > > + * > > + * Note that the ib_dma_*() functions defined below must be used > > + * to create/destroy addresses used with the Lkey or Rkey returned > > + * by ib_get_dma_mr(). > > + * > > + * Return: struct ib_mr pointer upon success, or a pointer encoded errno upon > > + * failure. > > + */ > > +static inline struct ib_mr *rdma_get_dma_mr(struct ib_pd *pd, int roles, > > + int attrs) > > +{ > > + return ib_get_dma_mr(pd, rdma_device_access_flags(pd, roles, attrs)); > > +} > > + > > I still think this wrapper should go away... Ok. I'll remove all uses of ib_get_dma_mr()... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: Sagi Grimberg [mailto:sagig@dev.mellanox.co.il] > Sent: Monday, July 06, 2015 2:59 AM > To: Steve Wise; dledford@redhat.com > Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target- > devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org > Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags > > On 7/6/2015 2:22 AM, Steve Wise wrote: > > The semantics for MR access flags are not consistent across RDMA > > protocols. So rather than have applications try and glean what they > > need, have them pass in the intended roles and attributes for the MR to > > be allocated and let the RDMA core select the appropriate access flags > > given the roles, attributes, and device capabilities. > > > > We introduce rdma_mr_roles and rdma_mr_attributes that enumerate the > > possible roles and attributes for a MR. These are documented in the > > enums themselves. > > > > New services exported: > > > > rdma_device_access_flags() - given the intended MR roles and attributes > > passed in, return the ib_access_flags bits for the device. > > > > rdma_get_dma_mr() - allocate a dma mr using the applications intended > > MR roles and MR attributes. This uses rdma_device_access_flags(). > > > > rdma_fast_reg_access_flags() - return the ib_access_flags bits needed > > for a fast register WR given the applications intended MR roles and > > MR attributes. This uses rdma_device_access_flags(). > > > > Signed-off-by: Steve Wise <swise@opengridcomputing.com> > > --- > > > > drivers/infiniband/core/verbs.c | 30 +++++++++++ > > include/rdma/ib_verbs.h | 106 +++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 136 insertions(+), 0 deletions(-) > > > > diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c > > index bac3fb4..f42595c 100644 > > --- a/drivers/infiniband/core/verbs.c > > +++ b/drivers/infiniband/core/verbs.c > > @@ -1144,6 +1144,36 @@ struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags) > > } > > EXPORT_SYMBOL(ib_get_dma_mr); > > > > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs) > > +{ > > + int access_flags = attrs; > > + > > + if (roles & RDMA_MRR_RECV) > > + access_flags |= IB_ACCESS_LOCAL_WRITE; > > + > > + if (roles & RDMA_MRR_WRITE_DEST) > > + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE; > > + > > + if (roles & RDMA_MRR_READ_DEST) { > > + access_flags |= IB_ACCESS_LOCAL_WRITE; > > + if (rdma_protocol_iwarp(pd->device, > > + rdma_start_port(pd->device))) > > + access_flags |= IB_ACCESS_REMOTE_WRITE; > > + } > > + > > + if (roles & RDMA_MRR_READ_SOURCE) > > + access_flags |= IB_ACCESS_REMOTE_READ; > > + > > + if (roles & RDMA_MRR_ATOMIC_DEST) > > + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC; > > + > > + if (roles & RDMA_MRR_MW_BIND) > > + access_flags |= IB_ACCESS_MW_BIND; > > + > > + return access_flags; > > +} > > +EXPORT_SYMBOL(rdma_device_access_flags); > > + > > struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, > > struct ib_phys_buf *phys_buf_array, > > int num_phys_buf, > > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > > index 986fddb..da1eadf 100644 > > --- a/include/rdma/ib_verbs.h > > +++ b/include/rdma/ib_verbs.h > > @@ -2494,6 +2494,112 @@ static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt) > > struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags); > > > > /** > > + * rdma_mr_roles - possible roles an RDMA MR will be used for > > + * > > + * This allows a transport independent RDMA application to > > + * create MRs that are usable for all the desired roles w/o > > + * having to understand which access rights are needed. > > + */ > > +enum { > > + > > + /* lkey used in a ib_recv_wr sge */ > > + RDMA_MRR_RECV = 1, > > + > > + /* lkey used for IB_WR_SEND* ops in the ib_send_wr sge */ > > + RDMA_MRR_SEND = (1<<1), > > + > > + /* rkey used for IB_WR_RDMA_READ* ops in ib_send_wr wr.rdma.rkey */ > > + RDMA_MRR_READ_SOURCE = (1<<2), > > + > > + /* lkey used for IB_WR_RDMA_READ* ops in the ib_send_wr sge */ > > + RDMA_MRR_READ_DEST = (1<<3), > > + > > + /* lkey used for IB_WR_RDMA_WRITE* ops in the ib_send_wr sge */ > > + RDMA_MRR_WRITE_SOURCE = (1<<4), > > + > > + /* rkey used for IB_WR_RDMA_WRITE* ops in ib_send_wr wr.rdma.rkey */ > > + RDMA_MRR_WRITE_DEST = (1<<5), > > + > > + /* > > + * lkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in the ib_send_wr sge > > + */ > > + RDMA_MRR_ATOMIC_SOURCE = (1<<6), > > + > > + /* > > + * rkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in ib_send_wr > > + * wr.atomic.rkey > > + */ > > + RDMA_MRR_ATOMIC_DEST = (1<<7), > > + > > + /* MR used for a IB_WR_MW_BIND in ib_send_wr wr.bind_mw.bind_info.mr */ > > + RDMA_MRR_MW_BIND = (1<<8), > > +}; > > + > > +/** > > + * rdma_mr_attributes - attributes for rdma memory regions > > + */ > > +enum { > > + RDMA_MRA_ZERO_BASED = IB_ZERO_BASED, > > + RDMA_MRA_ACCESS_ON_DEMAND = IB_ACCESS_ON_DEMAND, > > +}; > > + > > +/** > > + * rdma_device_access_flags - Returns the device-specific MR access flags. > > + * @pd: The protection domain associated with the memory region. > > + * @roles: The intended roles of the MR > > + * @attrs: The desired attributes of the MR > > + * > > + * Use the intended roles from @roles along with @attrs and the device > > + * capabilities to generate the needed access rights. > > + * > > + * Return: the ib_access_flags value needed to allocate the MR. > > + */ > > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs); > > + > > +/** > > + * rdma_get_dma_mr - Returns a memory region for system memory that is > > + * usable for DMA. > > + * @pd: The protection domain associated with the memory region. > > + * @roles: The intended roles of the MR > > + * @attrs: The desired attributes of the MR > > + * > > + * Use the intended roles from @roles along with @attrs and the device > > + * capabilities to define the needed access rights, and call > > + * ib_get_dma_mr() to allocate the MR. > > + * > > + * Note that the ib_dma_*() functions defined below must be used > > + * to create/destroy addresses used with the Lkey or Rkey returned > > + * by ib_get_dma_mr(). > > + * > > + * Return: struct ib_mr pointer upon success, or a pointer encoded errno upon > > + * failure. > > + */ > > +static inline struct ib_mr *rdma_get_dma_mr(struct ib_pd *pd, int roles, > > + int attrs) > > +{ > > + return ib_get_dma_mr(pd, rdma_device_access_flags(pd, roles, attrs)); > > +} > > + > > +/** > > + * rdma_fast_reg_access_flags - Returns the access flags needed for a fast > > + * register operation. > > + * @pd: The protection domain associated with the memory region. > > + * @roles: The intended roles of the MR > > + * @attrs: The desired attributes of the MR > > + * > > + * Use the intended roles from @roles along with @attrs and the device > > + * capabilities to define the needed access rights for a fast registration > > + * work request. > > + * > > + * Return: the ib_access_flags value needed for fast registration. > > + */ > > +static inline int rdma_fast_reg_access_flags(struct ib_pd *pd, int roles, > > + int attrs) > > +{ > > + return rdma_device_access_flags(pd, roles, attrs); > > +} > > Why do we have rdma_fast_reg_access_flags() that just trampolines to > rdma_device_access_flags()? why do we need it? My initial thought was that fast_reg access flags might be different that general MR access flags for some devices. But since that isn't the cases, I'll remove it. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/6/2015 5:37 PM, Steve Wise wrote: > > >> -----Original Message----- >> From: Sagi Grimberg [mailto:sagig@dev.mellanox.co.il] >> Sent: Monday, July 06, 2015 2:54 AM >> To: Steve Wise; dledford@redhat.com >> Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target- >> devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org >> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags >> >> On 7/6/2015 2:22 AM, Steve Wise wrote: >>> The semantics for MR access flags are not consistent across RDMA >>> protocols. So rather than have applications try and glean what they >>> need, have them pass in the intended roles and attributes for the MR to >>> be allocated and let the RDMA core select the appropriate access flags >>> given the roles, attributes, and device capabilities. >>> >>> We introduce rdma_mr_roles and rdma_mr_attributes that enumerate the >>> possible roles and attributes for a MR. These are documented in the >>> enums themselves. >>> >>> New services exported: >>> >>> rdma_device_access_flags() - given the intended MR roles and attributes >>> passed in, return the ib_access_flags bits for the device. >>> >>> rdma_get_dma_mr() - allocate a dma mr using the applications intended >>> MR roles and MR attributes. This uses rdma_device_access_flags(). >>> >>> rdma_fast_reg_access_flags() - return the ib_access_flags bits needed >>> for a fast register WR given the applications intended MR roles and >>> MR attributes. This uses rdma_device_access_flags(). >>> >>> Signed-off-by: Steve Wise <swise@opengridcomputing.com> >>> --- >>> >>> drivers/infiniband/core/verbs.c | 30 +++++++++++ >>> include/rdma/ib_verbs.h | 106 +++++++++++++++++++++++++++++++++++++++ >>> 2 files changed, 136 insertions(+), 0 deletions(-) >>> >>> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c >>> index bac3fb4..f42595c 100644 >>> --- a/drivers/infiniband/core/verbs.c >>> +++ b/drivers/infiniband/core/verbs.c >>> @@ -1144,6 +1144,36 @@ struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags) >>> } >>> EXPORT_SYMBOL(ib_get_dma_mr); >>> >>> +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs) >>> +{ >>> + int access_flags = attrs; >>> + >>> + if (roles & RDMA_MRR_RECV) >>> + access_flags |= IB_ACCESS_LOCAL_WRITE; >>> + >>> + if (roles & RDMA_MRR_WRITE_DEST) >>> + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE; >>> + >>> + if (roles & RDMA_MRR_READ_DEST) { >>> + access_flags |= IB_ACCESS_LOCAL_WRITE; >>> + if (rdma_protocol_iwarp(pd->device, >>> + rdma_start_port(pd->device))) >>> + access_flags |= IB_ACCESS_REMOTE_WRITE; >>> + } >>> + >>> + if (roles & RDMA_MRR_READ_SOURCE) >>> + access_flags |= IB_ACCESS_REMOTE_READ; >>> + >>> + if (roles & RDMA_MRR_ATOMIC_DEST) >>> + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC; >>> + >>> + if (roles & RDMA_MRR_MW_BIND) >>> + access_flags |= IB_ACCESS_MW_BIND; >>> + >>> + return access_flags; >>> +} >>> +EXPORT_SYMBOL(rdma_device_access_flags); >>> + >>> struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, >>> struct ib_phys_buf *phys_buf_array, >>> int num_phys_buf, >>> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h >>> index 986fddb..da1eadf 100644 >>> --- a/include/rdma/ib_verbs.h >>> +++ b/include/rdma/ib_verbs.h >>> @@ -2494,6 +2494,112 @@ static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt) >>> struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags); >>> >>> /** >>> + * rdma_mr_roles - possible roles an RDMA MR will be used for >>> + * >>> + * This allows a transport independent RDMA application to >>> + * create MRs that are usable for all the desired roles w/o >>> + * having to understand which access rights are needed. >>> + */ >>> +enum { >>> + >>> + /* lkey used in a ib_recv_wr sge */ >>> + RDMA_MRR_RECV = 1, >>> + >>> + /* lkey used for IB_WR_SEND* ops in the ib_send_wr sge */ >>> + RDMA_MRR_SEND = (1<<1), >>> + >>> + /* rkey used for IB_WR_RDMA_READ* ops in ib_send_wr wr.rdma.rkey */ >>> + RDMA_MRR_READ_SOURCE = (1<<2), >>> + >>> + /* lkey used for IB_WR_RDMA_READ* ops in the ib_send_wr sge */ >>> + RDMA_MRR_READ_DEST = (1<<3), >>> + >>> + /* lkey used for IB_WR_RDMA_WRITE* ops in the ib_send_wr sge */ >>> + RDMA_MRR_WRITE_SOURCE = (1<<4), >>> + >>> + /* rkey used for IB_WR_RDMA_WRITE* ops in ib_send_wr wr.rdma.rkey */ >>> + RDMA_MRR_WRITE_DEST = (1<<5), >>> + >>> + /* >>> + * lkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in the ib_send_wr sge >>> + */ >>> + RDMA_MRR_ATOMIC_SOURCE = (1<<6), >>> + >>> + /* >>> + * rkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in ib_send_wr >>> + * wr.atomic.rkey >>> + */ >>> + RDMA_MRR_ATOMIC_DEST = (1<<7), >>> + >>> + /* MR used for a IB_WR_MW_BIND in ib_send_wr wr.bind_mw.bind_info.mr */ >>> + RDMA_MRR_MW_BIND = (1<<8), >>> +}; >>> + >>> +/** >>> + * rdma_mr_attributes - attributes for rdma memory regions >>> + */ >>> +enum { >>> + RDMA_MRA_ZERO_BASED = IB_ZERO_BASED, >>> + RDMA_MRA_ACCESS_ON_DEMAND = IB_ACCESS_ON_DEMAND, >>> +}; >>> + >>> +/** >>> + * rdma_device_access_flags - Returns the device-specific MR access flags. >>> + * @pd: The protection domain associated with the memory region. >>> + * @roles: The intended roles of the MR >>> + * @attrs: The desired attributes of the MR >>> + * >>> + * Use the intended roles from @roles along with @attrs and the device >>> + * capabilities to generate the needed access rights. >>> + * >>> + * Return: the ib_access_flags value needed to allocate the MR. >>> + */ >>> +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs); >>> + >>> +/** >>> + * rdma_get_dma_mr - Returns a memory region for system memory that is >>> + * usable for DMA. >>> + * @pd: The protection domain associated with the memory region. >>> + * @roles: The intended roles of the MR >>> + * @attrs: The desired attributes of the MR >>> + * >>> + * Use the intended roles from @roles along with @attrs and the device >>> + * capabilities to define the needed access rights, and call >>> + * ib_get_dma_mr() to allocate the MR. >>> + * >>> + * Note that the ib_dma_*() functions defined below must be used >>> + * to create/destroy addresses used with the Lkey or Rkey returned >>> + * by ib_get_dma_mr(). >>> + * >>> + * Return: struct ib_mr pointer upon success, or a pointer encoded errno upon >>> + * failure. >>> + */ >>> +static inline struct ib_mr *rdma_get_dma_mr(struct ib_pd *pd, int roles, >>> + int attrs) >>> +{ >>> + return ib_get_dma_mr(pd, rdma_device_access_flags(pd, roles, attrs)); >>> +} >>> + >> >> I still think this wrapper should go away... > > Ok. I'll remove all uses of ib_get_dma_mr()... > > I meant that rdma_get_dma_mr can go away. I'd prefer to get the needed access_flags and just call existing verb. Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On Behalf Of Sagi Grimberg > Sent: Monday, July 06, 2015 11:18 AM > To: Steve Wise; dledford@redhat.com > Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target- > devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org > Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags > > On 7/6/2015 5:37 PM, Steve Wise wrote: > > > > > >> -----Original Message----- > >> From: Sagi Grimberg [mailto:sagig@dev.mellanox.co.il] > >> Sent: Monday, July 06, 2015 2:54 AM > >> To: Steve Wise; dledford@redhat.com > >> Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target- > >> devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org > >> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags > >> > >> On 7/6/2015 2:22 AM, Steve Wise wrote: > >>> The semantics for MR access flags are not consistent across RDMA > >>> protocols. So rather than have applications try and glean what they > >>> need, have them pass in the intended roles and attributes for the MR to > >>> be allocated and let the RDMA core select the appropriate access flags > >>> given the roles, attributes, and device capabilities. > >>> > >>> We introduce rdma_mr_roles and rdma_mr_attributes that enumerate the > >>> possible roles and attributes for a MR. These are documented in the > >>> enums themselves. > >>> > >>> New services exported: > >>> > >>> rdma_device_access_flags() - given the intended MR roles and attributes > >>> passed in, return the ib_access_flags bits for the device. > >>> > >>> rdma_get_dma_mr() - allocate a dma mr using the applications intended > >>> MR roles and MR attributes. This uses rdma_device_access_flags(). > >>> > >>> rdma_fast_reg_access_flags() - return the ib_access_flags bits needed > >>> for a fast register WR given the applications intended MR roles and > >>> MR attributes. This uses rdma_device_access_flags(). > >>> > >>> Signed-off-by: Steve Wise <swise@opengridcomputing.com> > >>> --- > >>> > >>> drivers/infiniband/core/verbs.c | 30 +++++++++++ > >>> include/rdma/ib_verbs.h | 106 +++++++++++++++++++++++++++++++++++++++ > >>> 2 files changed, 136 insertions(+), 0 deletions(-) > >>> > >>> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c > >>> index bac3fb4..f42595c 100644 > >>> --- a/drivers/infiniband/core/verbs.c > >>> +++ b/drivers/infiniband/core/verbs.c > >>> @@ -1144,6 +1144,36 @@ struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags) > >>> } > >>> EXPORT_SYMBOL(ib_get_dma_mr); > >>> > >>> +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs) > >>> +{ > >>> + int access_flags = attrs; > >>> + > >>> + if (roles & RDMA_MRR_RECV) > >>> + access_flags |= IB_ACCESS_LOCAL_WRITE; > >>> + > >>> + if (roles & RDMA_MRR_WRITE_DEST) > >>> + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE; > >>> + > >>> + if (roles & RDMA_MRR_READ_DEST) { > >>> + access_flags |= IB_ACCESS_LOCAL_WRITE; > >>> + if (rdma_protocol_iwarp(pd->device, > >>> + rdma_start_port(pd->device))) > >>> + access_flags |= IB_ACCESS_REMOTE_WRITE; > >>> + } > >>> + > >>> + if (roles & RDMA_MRR_READ_SOURCE) > >>> + access_flags |= IB_ACCESS_REMOTE_READ; > >>> + > >>> + if (roles & RDMA_MRR_ATOMIC_DEST) > >>> + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC; > >>> + > >>> + if (roles & RDMA_MRR_MW_BIND) > >>> + access_flags |= IB_ACCESS_MW_BIND; > >>> + > >>> + return access_flags; > >>> +} > >>> +EXPORT_SYMBOL(rdma_device_access_flags); > >>> + > >>> struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, > >>> struct ib_phys_buf *phys_buf_array, > >>> int num_phys_buf, > >>> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > >>> index 986fddb..da1eadf 100644 > >>> --- a/include/rdma/ib_verbs.h > >>> +++ b/include/rdma/ib_verbs.h > >>> @@ -2494,6 +2494,112 @@ static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt) > >>> struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags); > >>> > >>> /** > >>> + * rdma_mr_roles - possible roles an RDMA MR will be used for > >>> + * > >>> + * This allows a transport independent RDMA application to > >>> + * create MRs that are usable for all the desired roles w/o > >>> + * having to understand which access rights are needed. > >>> + */ > >>> +enum { > >>> + > >>> + /* lkey used in a ib_recv_wr sge */ > >>> + RDMA_MRR_RECV = 1, > >>> + > >>> + /* lkey used for IB_WR_SEND* ops in the ib_send_wr sge */ > >>> + RDMA_MRR_SEND = (1<<1), > >>> + > >>> + /* rkey used for IB_WR_RDMA_READ* ops in ib_send_wr wr.rdma.rkey */ > >>> + RDMA_MRR_READ_SOURCE = (1<<2), > >>> + > >>> + /* lkey used for IB_WR_RDMA_READ* ops in the ib_send_wr sge */ > >>> + RDMA_MRR_READ_DEST = (1<<3), > >>> + > >>> + /* lkey used for IB_WR_RDMA_WRITE* ops in the ib_send_wr sge */ > >>> + RDMA_MRR_WRITE_SOURCE = (1<<4), > >>> + > >>> + /* rkey used for IB_WR_RDMA_WRITE* ops in ib_send_wr wr.rdma.rkey */ > >>> + RDMA_MRR_WRITE_DEST = (1<<5), > >>> + > >>> + /* > >>> + * lkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in the ib_send_wr sge > >>> + */ > >>> + RDMA_MRR_ATOMIC_SOURCE = (1<<6), > >>> + > >>> + /* > >>> + * rkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in ib_send_wr > >>> + * wr.atomic.rkey > >>> + */ > >>> + RDMA_MRR_ATOMIC_DEST = (1<<7), > >>> + > >>> + /* MR used for a IB_WR_MW_BIND in ib_send_wr wr.bind_mw.bind_info.mr */ > >>> + RDMA_MRR_MW_BIND = (1<<8), > >>> +}; > >>> + > >>> +/** > >>> + * rdma_mr_attributes - attributes for rdma memory regions > >>> + */ > >>> +enum { > >>> + RDMA_MRA_ZERO_BASED = IB_ZERO_BASED, > >>> + RDMA_MRA_ACCESS_ON_DEMAND = IB_ACCESS_ON_DEMAND, > >>> +}; > >>> + > >>> +/** > >>> + * rdma_device_access_flags - Returns the device-specific MR access flags. > >>> + * @pd: The protection domain associated with the memory region. > >>> + * @roles: The intended roles of the MR > >>> + * @attrs: The desired attributes of the MR > >>> + * > >>> + * Use the intended roles from @roles along with @attrs and the device > >>> + * capabilities to generate the needed access rights. > >>> + * > >>> + * Return: the ib_access_flags value needed to allocate the MR. > >>> + */ > >>> +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs); > >>> + > >>> +/** > >>> + * rdma_get_dma_mr - Returns a memory region for system memory that is > >>> + * usable for DMA. > >>> + * @pd: The protection domain associated with the memory region. > >>> + * @roles: The intended roles of the MR > >>> + * @attrs: The desired attributes of the MR > >>> + * > >>> + * Use the intended roles from @roles along with @attrs and the device > >>> + * capabilities to define the needed access rights, and call > >>> + * ib_get_dma_mr() to allocate the MR. > >>> + * > >>> + * Note that the ib_dma_*() functions defined below must be used > >>> + * to create/destroy addresses used with the Lkey or Rkey returned > >>> + * by ib_get_dma_mr(). > >>> + * > >>> + * Return: struct ib_mr pointer upon success, or a pointer encoded errno upon > >>> + * failure. > >>> + */ > >>> +static inline struct ib_mr *rdma_get_dma_mr(struct ib_pd *pd, int roles, > >>> + int attrs) > >>> +{ > >>> + return ib_get_dma_mr(pd, rdma_device_access_flags(pd, roles, attrs)); > >>> +} > >>> + > >> > >> I still think this wrapper should go away... > > > > Ok. I'll remove all uses of ib_get_dma_mr()... > > > > > > I meant that rdma_get_dma_mr can go away. I'd prefer to get the > needed access_flags and just call existing verb. > So just export rdma_device_access_flags(rd, roles, attrs), and then change the ULPs to call this to obtain the access flags. That is ok with me. What do others think? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 06, 2015 at 09:23:42AM -0500, Steve Wise wrote: > > Please add an assert for the values that are allowed for attrs. > > > > It also would be highly useful to add a kerneldoc comment describing > > the function and the parameters. Also __bitwise sparse tricks > > to ensure the right flags are passed wouldn't be a too bad idea. > > > > Can you explain the "__bitwise sparse tricks"? Or point me to an example. Grep the kernel tree for __bitwise. It allows creating a type with additional type annotations so that sparse will warn when assigning incorrect values to it. > > Why not specfiy these as separate nuerical namespace? > > > > To avoid having to translate them. But they are different values, so you should translate them. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 06, 2015 at 07:17:38PM +0300, Sagi Grimberg wrote: > >Ok. I'll remove all uses of ib_get_dma_mr()... > > > > > > I meant that rdma_get_dma_mr can go away. I'd prefer to get the > needed access_flags and just call existing verb. I strongly disagree. As this series has shown the existing API is not epressive enough for all transports. It thus needs to go away and be fully replaced by the new API introduced here. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/7/2015 12:00 PM, Christoph Hellwig wrote: > On Mon, Jul 06, 2015 at 07:17:38PM +0300, Sagi Grimberg wrote: >>> Ok. I'll remove all uses of ib_get_dma_mr()... >>> >>> >> >> I meant that rdma_get_dma_mr can go away. I'd prefer to get the >> needed access_flags and just call existing verb. > > I strongly disagree. As this series has shown the existing API is not > epressive enough for all transports. It thus needs to go away and be > fully replaced by the new API introduced here. > Christoph, I wasn't arguing about having a transport independent API. I was referring to this wrapper specifically that trampolines to ib_get_dma_mr() with rdma_device_access_flags(pd, roles, attrs) helper. The rdma_device_access_flags() itself is fine. However, given that this helper is used elsewhere as well, I don't see the point of having yet another helper specifically just for the dma_mr case that does nothing more than trampolines with a call to rdma_device_access_flags(). Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On Behalf Of Sagi Grimberg > Sent: Tuesday, July 07, 2015 4:15 AM > To: Christoph Hellwig > Cc: Steve Wise; dledford@redhat.com; sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; > eli@mellanox.com; target-devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org > Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags > > On 7/7/2015 12:00 PM, Christoph Hellwig wrote: > > On Mon, Jul 06, 2015 at 07:17:38PM +0300, Sagi Grimberg wrote: > >>> Ok. I'll remove all uses of ib_get_dma_mr()... > >>> > >>> > >> > >> I meant that rdma_get_dma_mr can go away. I'd prefer to get the > >> needed access_flags and just call existing verb. > > > > I strongly disagree. As this series has shown the existing API is not > > epressive enough for all transports. It thus needs to go away and be > > fully replaced by the new API introduced here. > > > > Christoph, > > I wasn't arguing about having a transport independent API. I was > referring to this wrapper specifically that trampolines to > ib_get_dma_mr() with rdma_device_access_flags(pd, roles, attrs) helper. > > The rdma_device_access_flags() itself is fine. However, given that > this helper is used elsewhere as well, I don't see the point of having > yet another helper specifically just for the dma_mr case that does > nothing more than trampolines with a call to rdma_device_access_flags(). > I took the feedback from Christoph and Jason to mean I should remove ib_get_dma_mr() entirely and pull its guts into rdma_get_dma_mr(), and change all the users of ib_get_dma_mr() to use rdma_get_dma_mr(). So the net result isn't a wrapper. It would of course still use rdma_device_access_flags()... Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of Haggai Eran > Sent: Monday, July 06, 2015 2:09 AM > To: Steve Wise; dledford@redhat.com > Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target- > devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org > Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags > > On 06/07/2015 02:22, Steve Wise wrote: > > +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs) > > +{ > > + int access_flags = attrs; > > + > > + if (roles & RDMA_MRR_RECV) > > + access_flags |= IB_ACCESS_LOCAL_WRITE; > > + > > + if (roles & RDMA_MRR_WRITE_DEST) > > + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE; > > + > > + if (roles & RDMA_MRR_READ_DEST) { > > + access_flags |= IB_ACCESS_LOCAL_WRITE; > > + if (rdma_protocol_iwarp(pd->device, > > + rdma_start_port(pd->device))) > > + access_flags |= IB_ACCESS_REMOTE_WRITE; > > + } > > + > > + if (roles & RDMA_MRR_READ_SOURCE) > > + access_flags |= IB_ACCESS_REMOTE_READ; > > + > > + if (roles & RDMA_MRR_ATOMIC_DEST) > > + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC; > > I think you need LOCAL_WRITE for ATOMIC_SOURCE in order to receive the > results of the atomic operation. > Where/how are the results returned? In a recv completion? If so, then that MR would need RDMA_MRR_RECV, not RDMA_MRR_ATOMIC_SOURCE. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/07/2015 17:17, Steve Wise wrote: > > >> -----Original Message----- >> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of Haggai Eran >> Sent: Monday, July 06, 2015 2:09 AM >> To: Steve Wise; dledford@redhat.com >> Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target- >> devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org >> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags >> >> On 06/07/2015 02:22, Steve Wise wrote: >>> +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs) >>> +{ >>> + int access_flags = attrs; >>> + >>> + if (roles & RDMA_MRR_RECV) >>> + access_flags |= IB_ACCESS_LOCAL_WRITE; >>> + >>> + if (roles & RDMA_MRR_WRITE_DEST) >>> + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE; >>> + >>> + if (roles & RDMA_MRR_READ_DEST) { >>> + access_flags |= IB_ACCESS_LOCAL_WRITE; >>> + if (rdma_protocol_iwarp(pd->device, >>> + rdma_start_port(pd->device))) >>> + access_flags |= IB_ACCESS_REMOTE_WRITE; >>> + } >>> + >>> + if (roles & RDMA_MRR_READ_SOURCE) >>> + access_flags |= IB_ACCESS_REMOTE_READ; >>> + >>> + if (roles & RDMA_MRR_ATOMIC_DEST) >>> + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC; >> >> I think you need LOCAL_WRITE for ATOMIC_SOURCE in order to receive the >> results of the atomic operation. >> > > Where/how are the results returned? In a recv completion? If so, then that MR would need RDMA_MRR_RECV, not RDMA_MRR_ATOMIC_SOURCE. They are returned in the scatter list provided in ib_send_wr.sg_list, similarly to how RDMA read results are returned. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: Haggai Eran [mailto:haggaie@mellanox.com] > Sent: Tuesday, July 07, 2015 9:34 AM > To: Steve Wise; dledford@redhat.com > Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target- > devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org > Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags > > On 07/07/2015 17:17, Steve Wise wrote: > > > > > >> -----Original Message----- > >> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of Haggai Eran > >> Sent: Monday, July 06, 2015 2:09 AM > >> To: Steve Wise; dledford@redhat.com > >> Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target- > >> devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org > >> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags > >> > >> On 06/07/2015 02:22, Steve Wise wrote: > >>> +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs) > >>> +{ > >>> + int access_flags = attrs; > >>> + > >>> + if (roles & RDMA_MRR_RECV) > >>> + access_flags |= IB_ACCESS_LOCAL_WRITE; > >>> + > >>> + if (roles & RDMA_MRR_WRITE_DEST) > >>> + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE; > >>> + > >>> + if (roles & RDMA_MRR_READ_DEST) { > >>> + access_flags |= IB_ACCESS_LOCAL_WRITE; > >>> + if (rdma_protocol_iwarp(pd->device, > >>> + rdma_start_port(pd->device))) > >>> + access_flags |= IB_ACCESS_REMOTE_WRITE; > >>> + } > >>> + > >>> + if (roles & RDMA_MRR_READ_SOURCE) > >>> + access_flags |= IB_ACCESS_REMOTE_READ; > >>> + > >>> + if (roles & RDMA_MRR_ATOMIC_DEST) > >>> + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC; > >> > >> I think you need LOCAL_WRITE for ATOMIC_SOURCE in order to receive the > >> results of the atomic operation. > >> > > > > Where/how are the results returned? In a recv completion? If so, then that MR would need RDMA_MRR_RECV, not > RDMA_MRR_ATOMIC_SOURCE. > > They are returned in the scatter list provided in ib_send_wr.sg_list, > similarly to how RDMA read results are returned. Ah. Hmm. I was confused about how the atomic operations worked. Is this correct: ib_send_wr.wr.atomic.remote_addr : the peer's address that will be the target of the atomic operation. ib_send_wr.wr.atomic.compare_add/compare_add_mask: the data to be used in the atomic compare-and-add on the target address ib_send_wr.wr.atomic.swap/swap_mask: the data to be used in an atomic swap on the target address. ib_send_wr.sg_list: results from the swap or compare/add. Is the above correct? Maybe the two role names should be RDMA_MRR_ATOMIC_TARGET and RDMA_MRR_ATOMIC_RESULT? Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/07/2015 17:46, Steve Wise wrote: > > >> -----Original Message----- >> From: Haggai Eran [mailto:haggaie@mellanox.com] >> Sent: Tuesday, July 07, 2015 9:34 AM >> To: Steve Wise; dledford@redhat.com >> Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target- >> devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org >> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags >> >> On 07/07/2015 17:17, Steve Wise wrote: >>> >>> >>>> -----Original Message----- >>>> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of Haggai Eran >>>> Sent: Monday, July 06, 2015 2:09 AM >>>> To: Steve Wise; dledford@redhat.com >>>> Cc: sagig@mellanox.com; ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target- >>>> devel@vger.kernel.org; linux-nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org >>>> Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags >>>> >>>> On 06/07/2015 02:22, Steve Wise wrote: >>>>> +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs) >>>>> +{ >>>>> + int access_flags = attrs; >>>>> + >>>>> + if (roles & RDMA_MRR_RECV) >>>>> + access_flags |= IB_ACCESS_LOCAL_WRITE; >>>>> + >>>>> + if (roles & RDMA_MRR_WRITE_DEST) >>>>> + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE; >>>>> + >>>>> + if (roles & RDMA_MRR_READ_DEST) { >>>>> + access_flags |= IB_ACCESS_LOCAL_WRITE; >>>>> + if (rdma_protocol_iwarp(pd->device, >>>>> + rdma_start_port(pd->device))) >>>>> + access_flags |= IB_ACCESS_REMOTE_WRITE; >>>>> + } >>>>> + >>>>> + if (roles & RDMA_MRR_READ_SOURCE) >>>>> + access_flags |= IB_ACCESS_REMOTE_READ; >>>>> + >>>>> + if (roles & RDMA_MRR_ATOMIC_DEST) >>>>> + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC; >>>> >>>> I think you need LOCAL_WRITE for ATOMIC_SOURCE in order to receive the >>>> results of the atomic operation. >>>> >>> >>> Where/how are the results returned? In a recv completion? If so, then that MR would need RDMA_MRR_RECV, not >> RDMA_MRR_ATOMIC_SOURCE. >> >> They are returned in the scatter list provided in ib_send_wr.sg_list, >> similarly to how RDMA read results are returned. > > Ah. Hmm. I was confused about how the atomic operations worked. > > Is this correct: > > ib_send_wr.wr.atomic.remote_addr : the peer's address that will be the target of the atomic operation. > ib_send_wr.wr.atomic.compare_add/compare_add_mask: the data to be used in the atomic compare-and-add on the target address Yes, in compare and swap, this is the value to compare the original remote data with. In fetch and add, this would be the value to add. > ib_send_wr.wr.atomic.swap/swap_mask: the data to be used in an atomic swap on the target address. This is used in compare and swap as the data to write if the comparison succeeds. > ib_send_wr.sg_list: results from the swap or compare/add. > > Is the above correct? Pretty much. See above. > > Maybe the two role names should be RDMA_MRR_ATOMIC_TARGET and RDMA_MRR_ATOMIC_RESULT? Sounds good. Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 07, 2015 at 09:05:15AM -0500, Steve Wise wrote: > I took the feedback from Christoph and Jason to mean I should remove > ib_get_dma_mr() entirely and pull its guts into rdma_get_dma_mr(), > and change all the users of ib_get_dma_mr() to use > rdma_get_dma_mr(). So the net result isn't a wrapper. It would of > course still use rdma_device_access_flags()... Right, to the greatest extent possible. Keeping ib_get_dma_mr and the old flags around just means someone could go back and use the old flags. I expect well need to keep the driver entry point for user space, but we should be able to hide the API and flags from other modules. A wrapper is a reasonable way to stage through that transition.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/7/2015 7:17 PM, Jason Gunthorpe wrote: > On Tue, Jul 07, 2015 at 09:05:15AM -0500, Steve Wise wrote: > >> I took the feedback from Christoph and Jason to mean I should remove >> ib_get_dma_mr() entirely and pull its guts into rdma_get_dma_mr(), >> and change all the users of ib_get_dma_mr() to use >> rdma_get_dma_mr(). So the net result isn't a wrapper. It would of >> course still use rdma_device_access_flags()... > > Right, to the greatest extent possible. > > Keeping ib_get_dma_mr and the old flags around just means someone > could go back and use the old flags. It looks strange to me that there is a helper to get a transport independent access_flags rdma_device_access_flags() that helps the user to choose the correct access_flags for MR creation and fast registration (which is more than OK), and having a wrapper that just hides it from the user... Doesn't it look odd to you? > > I expect well need to keep the driver entry point for user space, but > we should be able to hide the API and flags from other modules. dma_mr for user-space?? How can this be relevant for user-space? > > A wrapper is a reasonable way to stage through that transition.. I can't say that I completely agree here. Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 07, 2015 at 07:27:47PM +0300, Sagi Grimberg wrote: > Doesn't it look odd to you? Sure, but the oddness is that rdma_device_access_flags exists at all, not the wrapper. The wrapper is what we want the API to look like, if we could trivially change the WR format as well then rdma_device_access_flags wouldn't even exist at all. > >I expect well need to keep the driver entry point for user space, but > >we should be able to hide the API and flags from other modules. > > dma_mr for user-space?? How can this be relevant for user-space? I didn't mean dma_mr specifically, but all the other driver entry points that use the old-style IBV_ACCESS_* flags. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/8/2015 12:36 AM, Jason Gunthorpe wrote: > On Tue, Jul 07, 2015 at 07:27:47PM +0300, Sagi Grimberg wrote: > >> Doesn't it look odd to you? > > Sure, but the oddness is that rdma_device_access_flags exists at all, > not the wrapper. The wrapper is what we want the API to look like, I don't necessarily agree. The API we'd want is a single API at all the call sites to all types of MRs. We have different QP types, and still we don't have an allocation API for each and every one. I honestly don't see why we have that for MRs. If we can converge to a single API for MR allocation we can just get it right once. > if we could trivially change the WR format as well then > rdma_device_access_flags wouldn't even exist at all. I have taken some time to truly think about that following Christoph's comments to my indirect registration patches. This is one of the things I'm looking at. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 07, 2015 at 09:05:15AM -0500, Steve Wise wrote: > I took the feedback from Christoph and Jason to mean I should remove ib_get_dma_mr() entirely and pull its guts into > rdma_get_dma_mr(), and change all the users of ib_get_dma_mr() to use rdma_get_dma_mr(). So the net result isn't a wrapper. It > would of course still use rdma_device_access_flags()... That's what I'd prefer, yes. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 08, 2015 at 10:29:56AM +0300, Sagi Grimberg wrote: > I don't necessarily agree. The API we'd want is a single API at all > the call sites to all types of MRs. We have different QP types, and > still we don't have an allocation API for each and every one. > I honestly don't see why we have that for MRs. > > If we can converge to a single API for MR allocation we can just get it > right once. That sounds even better in the long run, but sorting out one problem at a time seems like the easier way forward. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/8/2015 11:13 AM, 'Christoph Hellwig' wrote: > On Wed, Jul 08, 2015 at 10:29:56AM +0300, Sagi Grimberg wrote: >> I don't necessarily agree. The API we'd want is a single API at all >> the call sites to all types of MRs. We have different QP types, and >> still we don't have an allocation API for each and every one. >> I honestly don't see why we have that for MRs. >> >> If we can converge to a single API for MR allocation we can just get it >> right once. > > That sounds even better in the long run, but sorting out one problem at > a time seems like the easier way forward. > If we agree to consolidate on a single MR allocation API, I don't see how this wrapper is moving us forward. But if you guys prefer to have it than I don't have a hard objection. Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 08, 2015 at 01:05:28PM +0300, Sagi Grimberg wrote: > If we agree to consolidate on a single MR allocation API, I don't see > how this wrapper is moving us forward. But if you guys prefer to have it > than I don't have a hard objection. Well, when are we going to get that MR allocation API? Unless it's ready to go in ASAP I'd rather take a first step in the right direction instead of waiting. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/8/2015 1:20 PM, 'Christoph Hellwig' wrote: > On Wed, Jul 08, 2015 at 01:05:28PM +0300, Sagi Grimberg wrote: >> If we agree to consolidate on a single MR allocation API, I don't see >> how this wrapper is moving us forward. But if you guys prefer to have it >> than I don't have a hard objection. > > Well, when are we going to get that MR allocation API? Unless it's > ready to go in ASAP I'd rather take a first step in the right direction > instead of waiting. > I am still not clear if all of us agree that we need it. Sean and Steve had some disclaimers... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I am still not clear if all of us agree that we need it. > Sean and Steve had some disclaimers... A single entry point doesn't help a whole lot if the app must deal with different behavior based on how the API is used. We have a single entry point for post_send, for example, and that makes things worse. IMO, we need fewer registration *methods* not fewer calls. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 08, 2015 at 10:29:56AM +0300, Sagi Grimberg wrote: > >Sure, but the oddness is that rdma_device_access_flags exists at all, > >not the wrapper. The wrapper is what we want the API to look like, > > I don't necessarily agree. The API we'd want is a single API at all > the call sites to all types of MRs. Well, I'm speaking specifically about the access protection flags because that is all this patch set is working on. What to do about the rest of the mess is a whole other problem, and completely orthogonal to the access protection problem. > We have different QP types, and still we don't have an allocation > API for each and every one. I honestly don't see why we have that > for MRs. The MR stuff was never really designed, the HW people provided some capability and the SW side just raw exposed it, thoughtlessly. > If we can converge to a single API for MR allocation we can just get it > right once. I'm not sure focusing on MR is the right layer, but who knows. Every time I look at this, I just shake my head and go *WTF*? Even something simple like issuing a SEND against local memory seems horribly complicated. Why do we have local_dma_lkey and ib_get_dma_mr? Why is code using iWarp calling ib_get_dma_mr with RDMA_MRR_READ_DEST/IB_ACCESS_REMOTE_WRITE ? That is insecure. Why on earth is NFS using frmr to create RDMA READ lkeys? The flags change is easy, and makes sense, but this whole thing looks deeply rotten. I'd strongly suggest a MR cleanup should look first at purging lkey completely from the in-kernel API. I think when you do that, it quickly becomes clear that iWarp's problem is not a seemingly minor issue with different flag bits, but that iWarp *cannot* use local_dma_lkey as a RDMA READ local buffer. Using ib_get_dma_mr(IB_ACCESS_REMOTE_WRITE) is an insecure work around. So iWarp (and only iWarp) should take the pain of spinning up temporary local MRs for RDMA READ. This should all be hidden under a common API and it shouldn't be sprinkled all over the place in NFS, iSER, etc. So, I'd say, first order of buisness to fix the MR mess would be to purge lkey from the in-kernel API and rationalize the local side to something cleaner. Then, what is left is all remote MRs and maybe it will be clearer what to do about them then... > >if we could trivially change the WR format as well then > >rdma_device_access_flags wouldn't even exist at all. > > I have taken some time to truly think about that following Christoph's > comments to my indirect registration patches. This is one of the things > I'm looking at. I really hope you can come up with something here, I'm sure you can get some help on this issue, because all the storage ULPs are suffering together under this awful set of APIs. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 08, 2015 at 01:08:42PM -0600, Jason Gunthorpe wrote: > Then, what is left is all remote MRs and maybe it will be clearer what > to do about them then... From looking at that for a while the APIs needed seem pretty simple to me from a consumer perspective: struct rdma_mr *rmda_alloc_mr(struct ib_pd *pd, unsigned int nr_pages); void rdma_free_mr(struct rdma_mr *mr); /* updates *sg if the SG couldn't be fully registered due to offsets */ int rdma_register_sg(struct rdma_mr *mr, struct scatterlist **sg, u32 *pkey, u32 *offset, u32 *len); void rdma_unregister_sg(struct rdma_mr *mr, struct scatterlist *sg); plus maybe a pool alloc API if we care about FMR performance. Note that this assumes that the iSER bounce buffer hacks are replaced with the QUEUE_FLAG_SG_GAPS flags and a change to the SG_IO code to bounce buffer for vectored SG_IO calls. I'm happy to provide a patch for that if I can find a volunteer for testing. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 08, 2015 at 01:32:05PM -0700, 'Christoph Hellwig' wrote: > /* updates *sg if the SG couldn't be fully registered due to offsets */ > int rdma_register_sg(struct rdma_mr *mr, struct scatterlist **sg, > u32 *pkey, u32 *offset, u32 *len); plus an "enum dma_data_direction direction" of course. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/8/2015 3:08 PM, Jason Gunthorpe wrote: > The MR stuff was never really designed, the HW people provided some > capability and the SW side just raw exposed it, thoughtlessly. Jason, I don't disagree that the API can be improved. I have some responses to your statements below though. > Why is code using iWarp calling ib_get_dma_mr with > RDMA_MRR_READ_DEST/IB_ACCESS_REMOTE_WRITE ? That is insecure. Because the iWARP protocol requires it, which was very much an intentional decision. It actually is not insecure, as discussed in detail in RFC5042. However, it is different from Infiniband. > Why on earth is NFS using frmr to create RDMA READ lkeys? Because NFS desires to have a single code path that works for all transports. In addition, using the IB dma_mr as an lkey means that large I/O (common with NFS) would require multiple RDMA Read operations, when the page list exceeded the local scatter/gather limit of the device. > I think when you do that, it quickly becomes clear that iWarp's > problem is not a seemingly minor issue with different flag bits, but > that iWarp *cannot* use local_dma_lkey as a RDMA READ local buffer. > Using ib_get_dma_mr(IB_ACCESS_REMOTE_WRITE) is an insecure work > around. So iWarp (and only iWarp) should take the pain of spinning up > temporary local MRs for RDMA READ. That is entirely the wrong layer to address this. It would prevent reuse of MRs, and require upper layers to be aware that this was the case - which is exactly the opposite of what you are trying to achieve. > This should all be hidden under a common API and it shouldn't be > sprinkled all over the place in NFS, iSER, etc. Are you arguing that all upper layer storage protocols take a single common approach to memory registration? That is a very different discussion. Tom. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 08, 2015 at 05:38:05PM -0400, Tom Talpey wrote: > On 7/8/2015 3:08 PM, Jason Gunthorpe wrote: > >The MR stuff was never really designed, the HW people provided some > >capability and the SW side just raw exposed it, thoughtlessly. > > Jason, I don't disagree that the API can be improved. I have some > responses to your statements below though. > > >Why is code using iWarp calling ib_get_dma_mr with > >RDMA_MRR_READ_DEST/IB_ACCESS_REMOTE_WRITE ? That is insecure. > > Because the iWARP protocol requires it, which was very much an > intentional decision. It actually is not insecure, as discussed > in detail in RFC5042. However, it is different from Infiniband. You'll have to point me to the section on that.. ib_get_dma_mr(IB_ACCESS_REMOTE_WRITE) is defined to create a remote access MR to all of physical memory. That means there exists a packet the far side can send that will write to any place in physical memory. Yes, the far side has to guess the key, but there is no way you'll convince me that is secure, and 6.3.4 supports that position. "The ULP must set the base and bounds of the buffer when the STag is initialized to expose only the data to be retrieved." ib_get_dma_mr(IB_ACCESS_REMOTE_WRITE) directly contravenes that requirement. I assume you mean when it creates a FRMR it is secure that it uses IB_ACCESS_REMOTE_WRITE and the invalidates the MR before accessing it - and I agree with that. > >Why on earth is NFS using frmr to create RDMA READ lkeys? > > Because NFS desires to have a single code path that works for all > transports. In addition, using the IB dma_mr as an lkey means that > large I/O (common with NFS) would require multiple RDMA Read > operations, when the page list exceeded the local scatter/gather > limit of the device. So NFS got overwhelmed by API complexity and used a scheme that penalizes all hardware, in all cases, rather than fall back on MR only when absolutely necessary. Not really a ringing endorsement of status quo.... > >I think when you do that, it quickly becomes clear that iWarp's > >problem is not a seemingly minor issue with different flag bits, but > >that iWarp *cannot* use local_dma_lkey as a RDMA READ local buffer. > >Using ib_get_dma_mr(IB_ACCESS_REMOTE_WRITE) is an insecure work > >around. So iWarp (and only iWarp) should take the pain of spinning up > >temporary local MRs for RDMA READ. > > That is entirely the wrong layer to address this. It would prevent > reuse of MRs, and require upper layers to be aware that this was > the case - which is exactly the opposite of what you are trying > to achieve. How So? I'd hope for an API that is transparent to the upper layer, that lets the driver/core do a variety of things. If a call needs to create a temporary local MR, then the API should be structured so the driver/core can do that. If the MR type benefits from re-use then the core/driver should internally pool and re-use the MRs. Give me a reason why NFS should care about any of this? All it is doing is issuing RDMA READ's and expecting that data to land in local memory. Concretely, I'd imagine something like cookie = rdma_post_read(sendq,local_sg,remote_sg,wrid); [..] if (wc->wrid == wrid) rdma_complete_read(sendq,cookie); And the core/driver will RDMA READ the remote addresses in remote_sg list into the local_sg, using the best available strategy, and it doesn't have limits like only a few SG entries because 'best available strategy' includes using temporary MRs and multiple SQWEs. The driver will pick the 'best available strategy' that suits the HW it is driving, not NFS/iSER/SRP/Lustre. Same basic story for SEND, WRITE and RECV. > >This should all be hidden under a common API and it shouldn't be > >sprinkled all over the place in NFS, iSER, etc. > > Are you arguing that all upper layer storage protocols take a single > common approach to memory registration? That is a very different > discussion. I'm arguing upper layer protocols should never even see local memory registration, that it is totally irrelevant to them. So yes, you can call that a common approach to memory registration if you like.. Basically it appears there is nothing that NFS can do to optimize that process that cannot be done in the driver/core equally effectively and shared between all ULPs. If you see something otherwise, I'm really interested to hear about it. Even your case of the MR trade off for S/G list limitations - that is a performance point NFS has no buisness choosing. The driver is best placed to know when to switch between S/G lists, multiple RDMA READs and MR. The trade off will shift depending on HW limits: - Old mthca hardware is probably better to use multiple RDMA READ - mlx4 is probably better to use FRMR - mlx5 is likely best with indirect MR - All of the above are likely best to exhaust the S/G list first The same basic argument is true of WRITE, SEND and RECV. If the S/G list is exhausted then the API should transparently build a local MR to linearize the buffer, and the API should be designed so the core code can do that without the ULP having to be involved in those details. Is it possible? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 08, 2015 at 01:32:05PM -0700, 'Christoph Hellwig' wrote: > On Wed, Jul 08, 2015 at 01:08:42PM -0600, Jason Gunthorpe wrote: > > Then, what is left is all remote MRs and maybe it will be clearer what > > to do about them then... > > From looking at that for a while the APIs needed seem pretty simple > to me from a consumer perspective: > > struct rdma_mr *rmda_alloc_mr(struct ib_pd *pd, unsigned int nr_pages); > void rdma_free_mr(struct rdma_mr *mr); The major trouble with that is that the new MR types work by posting work to the send queue, that work creates the MR. I don't know all the details of how those schemes work, but it doesn't look like it fits into this model ? Maybe if rdma_register_sg was the only API and it accepted a QP, under the hood it could re-use a pooled/create a non-queued (F)MR, or issue FRMR, or work with indirect? It would be nice if it is that simple :) Sagi, are FMRs, FRMRs and Indirect MRs documented any place? I can't recall off hand if FRMR was ever published by the IBTA? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 08, 2015 at 06:03:37PM -0600, Jason Gunthorpe wrote: > The major trouble with that is that the new MR types work by posting > work to the send queue, that work creates the MR. > > I don't know all the details of how those schemes work, but it doesn't > look like it fits into this model ? It sorta does, but we do need a QP argument. I have implemented this abstraction in a driver, but it's still usign driver specific data structures at this point, so extracting all the arguments is a bit of a mess. But if there is some interest I can try to polish it up and port an existing driver or two to it. Note that post_send isn't really used for alloating the MR structure but performing the registration. > Maybe if rdma_register_sg was the only API and it accepted a QP, under > the hood it could re-use a pooled/create a non-queued (F)MR, or issue > FRMR, or work with indirect? I really want to avoid another allocator in my driver. With the blk-mq model we basically preallocate a request with all resoruces, so right now I totally avoid allocations in I/O path, which is something I'd like to keep if possible. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/8/2015 8:14 PM, Hefty, Sean wrote: >> I am still not clear if all of us agree that we need it. >> Sean and Steve had some disclaimers... > > A single entry point doesn't help a whole lot if the app must deal with different behavior based on how the API is used. It is true that different MRs will be used differently. However, not once we have found ourselves extending an API to add functionality. This means changing the API signature and changing all the call sites. Just recently we saw this in Steve's mr_roles and in Matan's timestamping support (changed ib_create_cq). When was the last time ib_create_qp API was modified? > We have a single entry point for post_send, for example, and that makes things worse. I don't necessarily agree. the fact that post_send have multiple entry points allows the consumer to concatenate a couple of those in a single post. That's beneficial to get maximum performance from your HW. > IMO, we need fewer registration *methods* not fewer calls. I do agree that we need fewer registration methods. Let's review what we have today: - FRWR: which is *the* standard way to register memory. It's fast, non-blocking and has vast support. - FMR: which is only supported in some Mellanox devices if I'm not mistaken, it's fast, but has slow invalidation (unmap). It is also not supported over VF. * FMR_POOL API was designed to address the slow unmap but created a week invalidation semantics (unmap is done only when some number of remapping is met). - REG_PHYS_MR: which is supported by some devices. It actually combines both MR allocation and registration in a single call (it is the equivalent of user-space ibv_reg_mr) I don't consider the dma_mr a registration method. It provides physical memory access. As for REG_PHYS_MR, this is the only synchronous registration method in the kernel. It is a simple interface, which is currently used by xprtrdma when dma mr is not supported. We can consider removing it in the future if it turns out to be non useful. As for FMR/FMR_POOL, I'd much rather to wait until it becomes deprecated on it's own rather than try and incorporate it in a modernized API. I think the stack can converge on FRWR as its primary registration method. Let's focus on making it better. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/8/2015 11:32 PM, 'Christoph Hellwig' wrote: > On Wed, Jul 08, 2015 at 01:08:42PM -0600, Jason Gunthorpe wrote: >> Then, what is left is all remote MRs and maybe it will be clearer what >> to do about them then... > > From looking at that for a while the APIs needed seem pretty simple > to me from a consumer perspective: > > struct rdma_mr *rmda_alloc_mr(struct ib_pd *pd, unsigned int nr_pages); Why assuming that the mr spans pages? I'd prefer if we consolidate on an existing mr allocation API ib_create_mr() that receives a generic struct ib_mr_init_attr. > void rdma_free_mr(struct rdma_mr *mr); > > /* updates *sg if the SG couldn't be fully registered due to offsets */ > int rdma_register_sg(struct rdma_mr *mr, struct scatterlist **sg, > u32 *pkey, u32 *offset, u32 *len); This is where I have a problem. Providing an API that may or may not post a work request on my QP is confusing, and I don't understand its semantics at all. Do I need to reserve slots on my QP? should I ask for a completion? If we suppress the completion will I see an error completion? What should I expect to find in the wr_id? The stack is converging on FRWR as the primary registration method. Any other methods will gradually be removed. I think that posting the work request must never be implicit. We should focus on making the registration work request better and hide the mechanics from the consumer. P.S. you probably didn't mean the argument pkey unless I'm completely missing something... > void rdma_unregister_sg(struct rdma_mr *mr, struct scatterlist *sg); Same arguments here for local invalidate work request. > Note that this assumes that the iSER bounce buffer hacks are replaced > with the QUEUE_FLAG_SG_GAPS flags and a change to the SG_IO code to > bounce buffer for vectored SG_IO calls. Yea - that's on my todo's... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/9/2015 3:03 AM, Jason Gunthorpe wrote: > On Wed, Jul 08, 2015 at 01:32:05PM -0700, 'Christoph Hellwig' wrote: >> On Wed, Jul 08, 2015 at 01:08:42PM -0600, Jason Gunthorpe wrote: >>> Then, what is left is all remote MRs and maybe it will be clearer what >>> to do about them then... >> >> From looking at that for a while the APIs needed seem pretty simple >> to me from a consumer perspective: >> >> struct rdma_mr *rmda_alloc_mr(struct ib_pd *pd, unsigned int nr_pages); >> void rdma_free_mr(struct rdma_mr *mr); > > The major trouble with that is that the new MR types work by posting > work to the send queue, that work creates the MR. > > I don't know all the details of how those schemes work, but it doesn't > look like it fits into this model ? > > Maybe if rdma_register_sg was the only API and it accepted a QP, under > the hood it could re-use a pooled/create a non-queued (F)MR, or issue > FRMR, or work with indirect? I think it's confusing to possibly do completely different registration methods which has different implications on the consumer RDMA channel. > > It would be nice if it is that simple :) It can be simple if we consolidate on a single method. > > Sagi, are FMRs, FRMRs and Indirect MRs documented any place? I can't > recall off hand if FRMR was ever published by the IBTA? I don't know if FMRs are in IBTA. FRMRs are in under 10.6.3.7 FAST REGISTRATION (vol 1) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/9/2015 2:36 AM, Jason Gunthorpe wrote: > > I'm arguing upper layer protocols should never even see local memory > registration, that it is totally irrelevant to them. So yes, you can > call that a common approach to memory registration if you like.. > > Basically it appears there is nothing that NFS can do to optimize that > process that cannot be done in the driver/core equally effectively and > shared between all ULPs. If you see something otherwise, I'm really > interested to hear about it. > > Even your case of the MR trade off for S/G list limitations - that is > a performance point NFS has no buisness choosing. The driver is best > placed to know when to switch between S/G lists, multiple RDMA READs > and MR. The trade off will shift depending on HW limits: > - Old mthca hardware is probably better to use multiple RDMA READ > - mlx4 is probably better to use FRMR > - mlx5 is likely best with indirect MR > - All of the above are likely best to exhaust the S/G list first > > The same basic argument is true of WRITE, SEND and RECV. If the S/G > list is exhausted then the API should transparently build a local MR > to linearize the buffer, and the API should be designed so the core > code can do that without the ULP having to be involved in those > details. > > Is it possible? > > Jason > Jason, We have protocol that involves remote memory keys transfer in their standards so I don't see how we can remove it altogether from ULPs. Putting that aside, My main problem with this approach is that once you do non-trivial things such as memory registration completely under the hood, it is a slippery slope for device drivers. If say a driver decides to register memory without the caller knowing, it would need to post an extra work request on the send queue. So once it sees the completion, it needs to silently consume it and have some non trivial logic to invalidate it (another work request!) either from poll_cq context or another thread. Moreover, this also means that the driver needs to allocate bigger send queues for possible future memory registration (depending on the IO pattern maybe). And I really don't like an API that instructs the user "please allocate some extra room in your send queue as I might need it". This would also require the drivers to take a huristic approach on how much memory registration resources are needed for all possible consumers (ipoib, sdp, srp, iser, nfs, more...) which might have different requirements. I know that these are implementation details, but the point is that vendor drivers can easily become a complete mess. I think we should try to find a balanced approach where both consumers and providers are not completely messed up. Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jul 9, 2015, at 4:46 AM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote: > On 7/8/2015 8:14 PM, Hefty, Sean wrote: >>> I am still not clear if all of us agree that we need it. >>> Sean and Steve had some disclaimers... >> >> A single entry point doesn't help a whole lot if the app must deal with different behavior based on how the API is used. > > It is true that different MRs will be used differently. However, not > once we have found ourselves extending an API to add functionality. This > means changing the API signature and changing all the call sites. Just > recently we saw this in Steve's mr_roles and in Matan's timestamping > support (changed ib_create_cq). When was the last time ib_create_qp API > was modified? > >> We have a single entry point for post_send, for example, and that makes things worse. > > I don't necessarily agree. the fact that post_send have multiple entry > points allows the consumer to concatenate a couple of those in a single > post. That's beneficial to get maximum performance from your HW. > >> IMO, we need fewer registration *methods* not fewer calls. > > I do agree that we need fewer registration methods. I also feel that would be a healthy direction. > Let's review what we have today: > > - FRWR: which is *the* standard way to register memory. It's fast, > non-blocking and has vast support. > > - FMR: which is only supported in some Mellanox devices if I'm not > mistaken, it's fast, but has slow invalidation (unmap). It is also not > supported over VF. > * FMR_POOL API was designed to address the slow unmap but created a > week invalidation semantics (unmap is done only when some number of > remapping is met). > > - REG_PHYS_MR: which is supported by some devices. It actually > combines both MR allocation and registration in a single call (it is > the equivalent of user-space ibv_reg_mr) There is also Memory Windows, but there may no longer be any kernel consumers of that memory registration method. > I don't consider the dma_mr a registration method. It provides physical > memory access. > > As for REG_PHYS_MR, this is the only synchronous registration method in > the kernel. It is a simple interface, which is currently used by xprtrdma when dma mr is not supported. We can consider removing it in > the future if it turns out to be non useful. Code review has shown the remaining ib_reg_phys_mr() call in xprtrdma is never reached in the current code, and thus it will be removed very soon. There is one remaining kernel user of ib_reg_phys_mr() in 4.2: Lustre. > As for FMR/FMR_POOL, I'd much rather to wait until it becomes > deprecated on it's own rather than try and incorporate it in a > modernized API. > I think the stack can converge on FRWR as its primary registration > method. Let's focus on making it better. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 09, 2015 at 02:02:03PM +0300, Sagi Grimberg wrote: > We have protocol that involves remote memory keys transfer in their > standards so I don't see how we can remove it altogether from ULPs. This is why I've been talking about local and remote MRs differently. A Local MR is one where the Key is never put on the wire, it exists soley to facilitate DMA between the CPU and the local HCA, and it would never be needed if we had infinitely long S/G lists. > My main problem with this approach is that once you do non-trivial > things such as memory registration completely under the hood, it is > a slippery slope for device drivers. Yes, there is going to be some stuff going on, but the simplification for the ULP side is incredible, it is certainly something that should be explored and not dismissed without some really good reasons. > If say a driver decides to register memory without the caller knowing, > it would need to post an extra work request on the send queue. Yes, the first issue is how to do flow control the sendq. But this seems easily solved, every ULP is already tracking the number of available entries in the senq, and it will not start new ops until there is space, so instead of doing the computation internally on how much space is needed to do X, we factor it out: if (rdma_sqes_post_read(...) < avail_sqe) avail_sqe -= rdma_post_read(...) else // Try again after completions advance Every new-style post op is paired with a 'how many entires do I need' call. This is not a new concept, a ULP working with FRMR already has to know it cannot start a FRMR using OP unless there is 2 SQE's available. (and it has to make all this conditional on if it using FRMR or something else). All this is doing is shifting the computation of '2' out of the ULP and into the driver. > So once it sees the completion, it needs to silently consume it and > have some non trivial logic to invalidate it (another work request!) > either from poll_cq context or another thread. Completions are driven by the ULP. Every new-style post also has a completion entry point. The ULP calls it when it knows the op has done, either because the WRID it provided has signaled completed, or because a later op has completed (signalling was supressed). Since that may also be an implicitly posting API (if necessary, is it?), it follows the same rules as above. This isn't changing anything. ULPs would already have to drive invalidate posts from completion with flow control, we are just moving the actul post construction and computation of the needed SQEs out of the ULP. > This would also require the drivers to take a huristic approach on how > much memory registration resources are needed for all possible > consumers (ipoib, sdp, srp, iser, nfs, more...) which might have > different requirements. That doesn't seem like a big issue. The ULP can give a hint on the PD or QP what sort of usage it expects. 'Up to 16 RDMA READS' 'Up to 1MB transfer per RDMA' and the core can use a pre-allocated pool scheme. I was thinking about a pre-allocation for local here, as Christoph suggests, I think that is a refinement we could certainly add on, once there is some clear idea what allocations are acutally necessary to spin up a temp MR. The basic issue I'd see is that the preallocation would be done without knowledge of the desired SG list, but maybe some kind of canonical 'max' SG could be used as a stand in... Put together, it would look like this: if (rdma_sqes_post_read(...) < avail_sqe) avail_sqe -= rdma_post_read(qp,...,read_wrid) [.. fetch wcs ...] if (wc.wrid == read_wrid) if (rdma_sqes_post_complete_read(...,read_wrid) < avail_sqe) rdma_post_complete_read(qp,...,read_wrid); else // queue read_wrid for later rdma_post_complete_read I'm not really seeing anything here that screams out this is impossible, or performance is impacted, or it is too messy on either the ULP or driver side. Laid out like this, I think it even means we can nuke the IB DMA API for these cases. rdma_post_read and rdma_post_complete_read are the two points that need dma api calls (cache flushes), and they can just do them internally. This also tells me that the above call sites must already exist in every ULP, so we, again, are not really substantially changing core control flow for the ULP. Are there more details that wreck the world? Just to break it down: - rdma_sqes_post_read figures out how many SQEs are needed to post the specified RDMA READ. On IB, if the SG list can be used then this is always 1. If the RDMA READ is split into N RDMA READS then it is N. For something like iWarp this would be (?) * FRMR SQE * RDMA READ SQE * FRMR Invalidate (signaled) Presumably we can squeeze FMR and so forth into this scheme as well? They don't seem to use SQE's so it is looks simpler.. Perhaps if an internal MR pool is exhausted this returns 0xFFFF and the caller will do a completion cycle, which may provide free MR's back to the pool. Ultimately once the SQ and CQ are totally drained the pool should be back to 100%? - rdma_post_read generates the necessary number of posts. The SQ must have the right number of entires available (see above) - rdma_post_complete_read is doing any clean up posts to make a MR ready to go again. Perhaps this isn't even posting? Semantically, I'd want to see rdma_post_complete_read returning to mean that the local read buffer is ready to go, and the ULP can start using it instantly. All invalidation is complete and all CPU caches are sync'd. This is where we'd start the recycling process for any temp MR, whatever that means.. I expect all these calls would be function pointers, and each driver would provide a function pointer that is optimal for it's use. Eg mlx4 would provide a pointer that used the S/G list, then falls back to FRMR if the S/G list is exhausted. The core code would provide a toolbox of common functions the drivers can use here. I didn't explore how errors work, but, I think, errors are just a labeling exercise: if (wc is error && wc.wrid == read_wrid rdma_error_complete_read(...,read_wrid,wc) Error recovery blows up the QP, so we just need to book keep and get the MRs accounted for. The driver could do a synchronous clean up of whatever mess is left during the next create_qp, or on the PD destroy. > I know that these are implementation details, but the point is that > vendor drivers can easily become a complete mess. I think we should > try to find a balanced approach where both consumers and providers are > not completely messed up. Sure, but today vendor drivers and the core is trivial while ULPs are an absolute mess. Goal #1 should be to move all the mess into the API and support all the existing methods. We should try as hard as possible to do that, and if along the way, it is just isn't possible, then fine. But that should be the first thing we try to reach for. Just tidying FRMR so it unifies with indirect, is a fine consolation prize, but I believe we can do better. To your point in another message, I'd say, as long as the new API supports FRMR at full speed with no performance penalty we are good. If the other variants out there take a performance hit, then I think that is OK. As you say, they are on the way out, we just need to make sure that ULPs continue to work with FMR with the new API so legacy cards don't completely break. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/9/2015 1:01 PM, Jason Gunthorpe wrote: > Laid out like this, I think it even means we can nuke the IB DMA API > for these cases. rdma_post_read and rdma_post_complete_read are the > two points that need dma api calls (cache flushes), and they can just > do them internally. > > This also tells me that the above call sites must already exist in > every ULP, so we, again, are not really substantially changing > core control flow for the ULP. > > Are there more details that wreck the world? Two things come to mind - PD's, and virtualization. If there's no ib_get_dma_mr() call, what PD does the region get? One could argue it inherits the QP's (Emulex proposed such a PD-less MR in this year's OFS Developer's Workshop). But this could impose new conditions on ULP's; they would have to be aware of this affinity and it could affect their QP use. More importantly, if a guest can post FRWR work requests with physical addresses, what enforces their validity? The dma_mr provides a PD but it also allows the host partition to interpose on the call, setting up an IOMMU mapping, creating a new NIC TPT mapping, etc. Without this, it may be possible for hostile guest to forge FRMR's and attack the host, or other guests. > I didn't explore how errors work, but, I think, errors are just a > labeling exercise: > if (wc is error && wc.wrid == read_wrid > rdma_error_complete_read(...,read_wrid,wc) > > Error recovery blows up the QP, so we just need to book keep and get > the MRs accounted for. The driver could do a synchronous clean up of > whatever mess is left during the next create_qp, or on the PD destroy. This is a subtle area. If the driver posts silenced completions as you describe, there may not be a wc to reap. So either the driver or the ULP will need to post a sentinel, the completion of which indicates any prior silenced operations have actually done so. This can be hard to get right. And if the driver posts everything signaled, well, performance at high IOPS will be a challenge. The ULP is much better positioned to manage that. I'm with you on the flow control, btw. It's a new rule for the ULP to obey, but probably not too onerous. Remember though, verbs today return EAGAIN when the queue you're posting is full (a terrible choice IMO). So upper layers don't actually need to count WR's, unless they want to. Tom. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 09, 2015 at 04:00:37PM -0400, Tom Talpey wrote: > On 7/9/2015 1:01 PM, Jason Gunthorpe wrote: > >Laid out like this, I think it even means we can nuke the IB DMA API > >for these cases. rdma_post_read and rdma_post_complete_read are the > >two points that need dma api calls (cache flushes), and they can just > >do them internally. > > > >This also tells me that the above call sites must already exist in > >every ULP, so we, again, are not really substantially changing > >core control flow for the ULP. > > > >Are there more details that wreck the world? > > Two things come to mind - PD's, and virtualization. > > If there's no ib_get_dma_mr() call, what PD does the region get? > One could argue it inherits the QP's (Emulex proposed such a > PD-less MR in this year's OFS Developer's Workshop). But this > could impose new conditions on ULP's; they would have to be > aware of this affinity and it could affect their QP use. You'll have to educate me here. Why does the in-kernel ULP care about what PD is used for lkey MRs? This is a real question - every ULP in kernel seems to create a single PD per device and thats it. (though NFS creates two, I assume that is some odd internal split, not actually functional) So, if there is a reason to have multiple PD's, nobody has done anything with it in the last decade? Thus, I think, for use on a lkey MR, it doesn't matter what PD is used, as long as the adaptor will execute the request. If the driver needs a PD, then using the one from the QP seems like it would cause no problems. > More importantly, if a guest can post FRWR work requests with > physical addresses, what enforces their validity? The dma_mr > provides a PD but it also allows the host partition to interpose > on the call, setting up an IOMMU mapping, creating a new NIC TPT > mapping, etc. Without this, it may be possible for hostile guest > to forge FRMR's and attack the host, or other guests. The MR concept doesn't go away, it just stops being exposed to the ULP. Every PD would implictly create a ib_get_dma_mr that can handle all local access. This isn't even a change, every ULP that creates a PD also immediately creates a ib_get_dma_mr, I'm just moving that into the alloc_pd call so the ULP never sees it. FRMR doesn't change, other than the WR is created in the core/driver layer, not the ULP. Whatever is put in that WR isn't going to change, if it is secure today, then it will still be secure tomorrow. Temporary MRs mean that rmda_post_read needs to be called in a context where it can hypercall to create one, if the driver needs. I don't think this is any different from the requirements the DMA API has? So nothing changes. Not exactly sure what ULPs do here, but I'm expecting they do this call before posting. Do any thread it? Sagi's comment on indirect registrations seems like future virtualization schemes will use a SQE post like FRMR to do this, but I don't know the details. So, nothing really seems to change here. What am I missing? > > I didn't explore how errors work, but, I think, errors are just a > > labeling exercise: > > if (wc is error && wc.wrid == read_wrid > > rdma_error_complete_read(...,read_wrid,wc) > > > > Error recovery blows up the QP, so we just need to book keep and get > > the MRs accounted for. The driver could do a synchronous clean up of > > whatever mess is left during the next create_qp, or on the PD destroy. > > This is a subtle area. If the driver posts silenced completions as > you describe, there may not be a wc to reap. So either the driver or > the ULP will need to post a sentinel, the completion of which indicates > any prior silenced operations have actually done so. This can be > hard to get right. And if the driver posts everything signaled, well, > performance at high IOPS will be a challenge. The ULP is much better > positioned to manage that. But again, nothing really changes. The ULP calls rdma_post_read, and like every other ib_post_send type thing, it will ask for signaled completion or not. If yes, then the last WR in the chain gets marked, otherwise none are marked. No different than what a ULP would do today when it builds WRs manually. The trick is, every call to rdma_post_read *MUST* be paired with a call to either rdma_error_complete_read, or rdma_post_complete_read with the wrid from above. If the ULP is using supressed completion, then it must bookkeep this stuff, and if it sees a completion for WRID+10, it must go through and call rdma_post_complete_read for WRID+0,+1,+2, ... Error unwind is similar.. > I'm with you on the flow control, btw. It's a new rule for the > ULP to obey, but probably not too onerous. Remember though, verbs > today return EAGAIN when the queue you're posting is full (a > terrible choice IMO). So upper layers don't actually need to > count WR's, unless they want to. Yes, EAGAIN is ugly, but we cannot trivially support that at this proposed API level. The issue is idempotency, if the core/driver needs to push 3 SGEs on, and only 2 make it, then WTF does the ULP do? To support EAGAIN, the core/drivers needs to support an all or nothing ib_post_send, which is beyond the driver capability we have today... Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/08/2015 08:03 PM, Jason Gunthorpe wrote: > On Wed, Jul 08, 2015 at 01:32:05PM -0700, 'Christoph Hellwig' wrote: >> On Wed, Jul 08, 2015 at 01:08:42PM -0600, Jason Gunthorpe wrote: >>> Then, what is left is all remote MRs and maybe it will be clearer what >>> to do about them then... >> >> From looking at that for a while the APIs needed seem pretty simple >> to me from a consumer perspective: >> >> struct rdma_mr *rmda_alloc_mr(struct ib_pd *pd, unsigned int nr_pages); >> void rdma_free_mr(struct rdma_mr *mr); > > The major trouble with that is that the new MR types work by posting > work to the send queue, that work creates the MR. > > I don't know all the details of how those schemes work, but it doesn't > look like it fits into this model ? > > Maybe if rdma_register_sg was the only API and it accepted a QP, under > the hood it could re-use a pooled/create a non-queued (F)MR, or issue > FRMR, or work with indirect? > > It would be nice if it is that simple :) > > Sagi, are FMRs, FRMRs and Indirect MRs documented any place? I can't > recall off hand if FRMR was ever published by the IBTA? A lot of this discussion is interesting and has gone off in an area that I think is worth pursuing in the long run. However, in the short run, this patch was a minor cleanup/simplification patch. These discussions are moving into complete redesigns and rearchitecting. Steve, I'm OK with the cleanup and would prefer that it stay separate from these larger issues. Baby steps, one things at a time.
On Thu, Jul 09, 2015 at 06:18:26PM -0400, Doug Ledford wrote: > A lot of this discussion is interesting and has gone off in an area that > I think is worth pursuing in the long run. However, in the short run, > this patch was a minor cleanup/simplification patch. These discussions > are moving into complete redesigns and rearchitecting. Steve, I'm OK > with the cleanup and would prefer that it stay separate from these > larger issues. So, I was originally of the view the flags change was fine. But when I realized that it basically hides a ib_get_dma_mr(IB_ACCESS_REMOTE_WRITE) behind an opaque API: rdma_get_dma_mr(RDMA_MRR_READ_DEST) I changed my mind. For iWarp the above creates a rkey that can RDMA WRITE to any place in system memory. If a remote guesses that rkey your system is totally compromised. So it is insecure, and contrary to the advice in RFC5042. Seeing rdma_get_dma_mr(RDMA_MRR_READ_DEST) added all over the code base terrifies me, because it is utterly wrong, and looks harmless. The mistep, is that enabling iSER for iWarp is not just this trivial change: - device->mr = ib_get_dma_mr(device->pd, IB_ACCESS_LOCAL_WRITE); + mr_roles = RDMA_MRR_RECV | RDMA_MRR_SEND | RDMA_MRR_WRITE_SOURCE | + RDMA_MRR_READ_DEST; + device->mr = rdma_get_dma_mr(device->pd, mr_roles, 0); But, it must also follow the path of NFS and use FRMR on iWarp for RDMA READ lkeys. This is the real quirk of iWarp, not that the MR flags are slightly different. From there, it is a logical wish: If Steve is going to FRMR'ize iSER to be acceptable security wise, I'd rather see that work done on in a general way. Hence the API discussion. What do you think? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/9/2015 8:01 PM, Jason Gunthorpe wrote: > On Thu, Jul 09, 2015 at 02:02:03PM +0300, Sagi Grimberg wrote: > >> We have protocol that involves remote memory keys transfer in their >> standards so I don't see how we can remove it altogether from ULPs. > > This is why I've been talking about local and remote MRs > differently. IMHO, memory registration is memory registration. The fact that we are distinguishing between local and remote might be a sign that this might be a wrong direction to take. Sorry. Besides, if a ULP wants to register memory for local access why should we temper with that or deny it? What if a ULP has a pre-allocated pool of large buffers that it knows it is going to use for its entire lifetime? silent driver driven FRWRs would perform a lot worse than pre-registering these buffers. Or what if the ULP wants to register the memory region with data integrity (signature) parameters? I must say, whenever I find myself trying to assume/guess what the ULP/APP might do from the driver PoV and try to see if I'm covered, I shake my head saying: "This is a hack, go drink some water and rethink the whole thing". If there is one thing worse than a complicated API, it is a restrictive one. I'd much rather ULPs just having a simple API for registering memory. > A Local MR is one where the Key is never put on the wire, > it exists soley to facilitate DMA between the CPU and the local HCA, > and it would never be needed if we had infinitely long S/G lists. > >> My main problem with this approach is that once you do non-trivial >> things such as memory registration completely under the hood, it is >> a slippery slope for device drivers. > > Yes, there is going to be some stuff going on, but the simplification > for the ULP side is incredible, it is certainly something that should > be explored and not dismissed without some really good reasons. > >> If say a driver decides to register memory without the caller knowing, >> it would need to post an extra work request on the send queue. > > Yes, the first issue is how to do flow control the sendq. > > But this seems easily solved, every ULP is already tracking the > number of available entries in the senq, and it will not start new ops > until there is space, so instead of doing the computation internally > on how much space is needed to do X, we factor it out: > > if (rdma_sqes_post_read(...) < avail_sqe) > avail_sqe -= rdma_post_read(...) > else > // Try again after completions advance > > Every new-style post op is paired with a 'how many entires do I need' > call. > > This is not a new concept, a ULP working with FRMR already has to know > it cannot start a FRMR using OP unless there is 2 SQE's > available. (and it has to make all this conditional on if it using > FRMR or something else). All this is doing is shifting the computation > of '2' out of the ULP and into the driver. > >> So once it sees the completion, it needs to silently consume it and >> have some non trivial logic to invalidate it (another work request!) >> either from poll_cq context or another thread. > > Completions are driven by the ULP. Every new-style post also has a > completion entry point. The ULP calls it when it knows the op has > done, either because the WRID it provided has signaled completed, or > because a later op has completed (signalling was supressed). > > Since that may also be an implicitly posting API (if necessary, is > it?), it follows the same rules as above. This isn't changing > anything. ULPs would already have to drive invalidate posts from > completion with flow control, we are just moving the actul post > construction and computation of the needed SQEs out of the ULP. > >> This would also require the drivers to take a huristic approach on how >> much memory registration resources are needed for all possible >> consumers (ipoib, sdp, srp, iser, nfs, more...) which might have >> different requirements. > > That doesn't seem like a big issue. The ULP can give a hint on the PD > or QP what sort of usage it expects. 'Up to 16 RDMA READS' 'Up to 1MB > transfer per RDMA' and the core can use a pre-allocated pool scheme. > > I was thinking about a pre-allocation for local here, as Christoph > suggests, I think that is a refinement we could certainly add on, once > there is some clear idea what allocations are acutally necessary to > spin up a temp MR. The basic issue I'd see is that the preallocation > would be done without knowledge of the desired SG list, but maybe some > kind of canonical 'max' SG could be used as a stand in... > > Put together, it would look like this: > if (rdma_sqes_post_read(...) < avail_sqe) > avail_sqe -= rdma_post_read(qp,...,read_wrid) > [.. fetch wcs ...] > if (wc.wrid == read_wrid) > if (rdma_sqes_post_complete_read(...,read_wrid) < avail_sqe) > rdma_post_complete_read(qp,...,read_wrid); > else > // queue read_wrid for later rdma_post_complete_read > > I'm not really seeing anything here that screams out this is > impossible, or performance is impacted, or it is too messy on either > the ULP or driver side. I think it is possible (at the moment). But I don't know if we should have the drivers abusing the send/completion queues like that. I can't say I'm fully on board with the idea of silent send-queue posting and silent completion consuming. > > Laid out like this, I think it even means we can nuke the IB DMA API > for these cases. rdma_post_read and rdma_post_complete_read are the > two points that need dma api calls (cache flushes), and they can just > do them internally. > > This also tells me that the above call sites must already exist in > every ULP, so we, again, are not really substantially changing > core control flow for the ULP. > > Are there more details that wreck the world? > > Just to break it down: > - rdma_sqes_post_read figures out how many SQEs are needed to post > the specified RDMA READ. > On IB, if the SG list can be used then this is always 1. > If the RDMA READ is split into N RDMA READS then it is N. > For something like iWarp this would be (?) > * FRMR SQE > * RDMA READ SQE > * FRMR Invalidate (signaled) > > Presumably we can squeeze FMR and so forth into this scheme as > well? They don't seem to use SQE's so it is looks simpler.. > > Perhaps if an internal MR pool is exhausted this returns 0xFFFF > and the caller will do a completion cycle, which may provide > free MR's back to the pool. Ultimately once the SQ and CQ are > totally drained the pool should be back to 100%? > - rdma_post_read generates the necessary number of posts. > The SQ must have the right number of entires available > (see above) > - rdma_post_complete_read is doing any clean up posts to make a MR > ready to go again. Perhaps this isn't even posting? > > Semantically, I'd want to see rdma_post_complete_read returning to > mean that the local read buffer is ready to go, and the ULP can > start using it instantly. All invalidation is complete and all > CPU caches are sync'd. > > This is where we'd start the recycling process for any temp MR, > whatever that means.. > > I expect all these calls would be function pointers, and each driver > would provide a function pointer that is optimal for it's use. Eg mlx4 > would provide a pointer that used the S/G list, then falls back to > FRMR if the S/G list is exhausted. The core code would provide a > toolbox of common functions the drivers can use here. Maybe it's just me, but I can't help but wander if this is facilitating an atmosphere where drivers will keep finding new ways to abuse even the most simple operations. I need more time to comprehend. > > I didn't explore how errors work, but, I think, errors are just a > labeling exercise: > if (wc is error && wc.wrid == read_wrid > rdma_error_complete_read(...,read_wrid,wc) > > Error recovery blows up the QP, so we just need to book keep and get > the MRs accounted for. The driver could do a synchronous clean up of > whatever mess is left during the next create_qp, or on the PD destroy. > >> I know that these are implementation details, but the point is that >> vendor drivers can easily become a complete mess. I think we should >> try to find a balanced approach where both consumers and providers are >> not completely messed up. > > Sure, but today vendor drivers and the core is trivial while ULPs are > an absolute mess. > > Goal #1 should be to move all the mess into the API and support all > the existing methods. We should try as hard as possible to do that, > and if along the way, it is just isn't possible, then fine. But that > should be the first thing we try to reach for. > > Just tidying FRMR so it unifies with indirect, is a fine consolation > prize, but I believe we can do better. > > To your point in another message, I'd say, as long as the new API > supports FRMR at full speed with no performance penalty we are > good. If the other variants out there take a performance hit, then I > think that is OK. As you say, they are on the way out, we just need to > make sure that ULPs continue to work with FMR with the new API so > legacy cards don't completely break. My intention is to improve FRWR API and gradually remove the other APIs from the kernel (i.e. FMR/FMR_POOL/MW). As I said, I don't think that striving to an API that implicitly chooses how to register memory is a good idea. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/9/2015 6:53 PM, Jason Gunthorpe wrote: > On Thu, Jul 09, 2015 at 06:18:26PM -0400, Doug Ledford wrote: > >> A lot of this discussion is interesting and has gone off in an area that >> I think is worth pursuing in the long run. However, in the short run, >> this patch was a minor cleanup/simplification patch. These discussions >> are moving into complete redesigns and rearchitecting. Steve, I'm OK >> with the cleanup and would prefer that it stay separate from these >> larger issues. > > So, I was originally of the view the flags change was fine. > > But when I realized that it basically hides a > ib_get_dma_mr(IB_ACCESS_REMOTE_WRITE) behind an opaque API: > > rdma_get_dma_mr(RDMA_MRR_READ_DEST) > > I changed my mind. Yes, I agree. Creating a writable privileged rmr for the RDMA Read sink buffer exposes all of physical memory to a simple arithmetic mistake. This should never be allowed in a production config. That said, for the use by the NFS/RDMA Server, the risk is significantly mitigated. The sink MR is never sent to the peer by the NFS upper layer, it is protected by the connection's PD, and it is enabled only when the RDMA Read is active. However, the very real risk remains. This should definitely not be the standard API. > For iWarp the above creates a rkey that can RDMA WRITE to any place in > system memory. If a remote guesses that rkey your system is totally > compromised. So it is insecure, and contrary to the advice in RFC5042. > > Seeing rdma_get_dma_mr(RDMA_MRR_READ_DEST) added all over the code > base terrifies me, because it is utterly wrong, and looks harmless. > > The mistep, is that enabling iSER for iWarp is not just this trivial > change: > > - device->mr = ib_get_dma_mr(device->pd, IB_ACCESS_LOCAL_WRITE); > + mr_roles = RDMA_MRR_RECV | RDMA_MRR_SEND | RDMA_MRR_WRITE_SOURCE | > + RDMA_MRR_READ_DEST; > + device->mr = rdma_get_dma_mr(device->pd, mr_roles, 0); > Yep. It's the only safe, correct solution. > But, it must also follow the path of NFS and use FRMR on iWarp for > RDMA READ lkeys. This is the real quirk of iWarp, not that the MR > flags are slightly different. > > From there, it is a logical wish: If Steve is going to FRMR'ize iSER > to be acceptable security wise, I'd rather see that work done on in a > general way. Hence the API discussion. Well, it's important to realize that NFS already does the Right Thing. So it isn't actually an urgent issue. There is time to discuss. Tom. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jul 10, 2015 at 09:22:24AM -0400, Tom Talpey wrote: > and it is enabled only when the RDMA Read is active. ??? How is that done? ib_get_dma_mr is defined to return a remote usable rkey that is valid from when it it returns until the MR is destroyed. NFS creates one mr early on, it does not seem to make one per RDMA READ? For the proposed iSER patch the problem is very acute, iser makes a single PD and phys MR at boot time for each device. This means there is a single machine wide unchanging rkey that allows remote physical memory access. An attacker only has to repeatedly open QPs to iSER and guess rkey values until they find it. Add in likely non-crypto randomness for rkeys, and I bet it isn't even that hard to do. So this seems to be a catastrophic security failing. > > From there, it is a logical wish: If Steve is going to FRMR'ize iSER > >to be acceptable security wise, I'd rather see that work done on in a > >general way. Hence the API discussion. > > Well, it's important to realize that NFS already does the Right Thing. > So it isn't actually an urgent issue. There is time to discuss. It depends on what is going on with iWarp iSER. If the iWarp community thinks it should go ahead insecurely then fine, put a giant warning in dmesg and go ahead, otherwise iWarp needs to address this difference with a proper generic API, which appears, must wrapper ib_post_send. Harder job :\ I'm sorry Steve for leading you down the wrong path with these flags, I did not fully realize exactly what the iWarp difference was at the start :( Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jul 10, 2015 at 11:55:29AM +0300, Sagi Grimberg wrote: > IMHO, memory registration is memory registration. The fact that we are > distinguishing between local and remote might be a sign that this might > be a wrong direction to take. Sorry. I belive they are very different, yes, if you sit at the level of the specification or maybe even a HCA design, they are nice and similar, but for someone writing a ULP - local is not helping. > What if a ULP has a pre-allocated pool of large buffers that it knows > it is going to use for its entire lifetime? silent driver driven FRWRs > would perform a lot worse than pre-registering these buffers. Okay, so I think you've gone too far down the path here. I am proposing an API direction that hides the lkey entirely and provides common posting path for ULPs that needs dynamic short term on the fly MR creation. This is basically every ULP in the kernel today. I'm not saying we should rip out all of the lkey stuff and hobble ib_post_send, only that this new API, aimed *specifically* at simplifying our *current* universe of ULPs should do that. > Or what if the ULP wants to register the memory region with data > integrity (signature) parameters? Then it calls rdma_post_write_dif(..)/etc. Is that not good enough? > If there is one thing worse than a complicated API, it is a restrictive > one. I'd much rather ULPs just having a simple API for registering > memory. No, strongly disagree. A restrictive API that solves exactly the problem our ULPs today face is *exactly* what we need here. This is in-kernel. It isn't a UAPI. It isn't an industry standard. We can change and revise it next year if we need. > >I'm not really seeing anything here that screams out this is > >impossible, or performance is impacted, or it is too messy on either > >the ULP or driver side. > > I think it is possible (at the moment). But I don't know if we should > have the drivers abusing the send/completion queues like that. > > I can't say I'm fully on board with the idea of silent send-queue > posting and silent completion consuming. I'm not sure I'd call it silent, it tells the ULP how many slots it will use. > >I expect all these calls would be function pointers, and each driver > >would provide a function pointer that is optimal for it's use. Eg mlx4 > >would provide a pointer that used the S/G list, then falls back to > >FRMR if the S/G list is exhausted. The core code would provide a > >toolbox of common functions the drivers can use here. > > Maybe it's just me, but I can't help but wander if this is facilitating > an atmosphere where drivers will keep finding new ways to abuse even > the most simple operations. Maybe, I'm not sure. Being restrictive with the API certainly prevents alot of 'creative' uses. It is hard to argue about what rdma_post_rdma_read should do, it can be made quite narrowly defined. If mlx4 implements this with a FRMR call and mlx5 uses Indirect MR, and qib implements it with a non-standard extended scatter list - do I care? Not really. But it does seem provide a much saner way for vendors to add extensions, ie I think I'd rather see a rdma_post_write_dif than a bunch of non-standard extensions in FRMR flags and WR attributes. > I need more time to comprehend. Please think about it, I'm pretty sure the iWarp guys *have* to go down this road, it is a good way for them to implement their quirk on RDMA READ across many ULPs. Understand the iWarp problem is that they cannot use a phys dma MR for their RDMA READ lkey - this is a major difference from IB. The issue is larger than just memory registration. > My intention is to improve FRWR API and gradually remove the other APIs > from the kernel (i.e. FMR/FMR_POOL/MW). As I said, I don't think that > striving to an API that implicitly chooses how to register memory is a > good idea. Can you explain why? And I mean specifically - how will NFS/ISER/SRP/Lustre specifically be impacted if we move that choice into the core/driver layer? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/10/2015 12:11 PM, Jason Gunthorpe wrote: > On Fri, Jul 10, 2015 at 09:22:24AM -0400, Tom Talpey wrote: >> and it is enabled only when the RDMA Read is active. > > ??? How is that done? ib_get_dma_mr is defined to return a remote > usable rkey that is valid from when it it returns until the MR is > destroyed. NFS creates one mr early on, it does not seem to make one > per RDMA READ? > > For the proposed iSER patch the problem is very acute, iser makes a > single PD and phys MR at boot time for each device. This means there > is a single machine wide unchanging rkey that allows remote physical > memory access. An attacker only has to repeatedly open QPs to iSER and > guess rkey values until they find it. Add in likely non-crypto > randomness for rkeys, and I bet it isn't even that hard to do. > > So this seems to be a catastrophic security failing. As I recall, didn't the NFSoRDMA stack do even worse in the beginning (meaning it previously had a memory model that was simply "register all memory on the client and pass the rkey to the server and then tell the server when/where to read/write", which was subsequently removed in Chuck's NFSoRDMA cleanups over the last 5 or so kernel versions)? I'm not bringing this up to downplay the current iSER issue, but to demonstrate that we have long had a different trust model than the one in RFC5042. More below on that. >>> From there, it is a logical wish: If Steve is going to FRMR'ize iSER >>> to be acceptable security wise, I'd rather see that work done on in a >>> general way. Hence the API discussion. >> >> Well, it's important to realize that NFS already does the Right Thing. >> So it isn't actually an urgent issue. There is time to discuss. > > It depends on what is going on with iWarp iSER. If the iWarp community > thinks it should go ahead insecurely then fine, put a giant warning in > dmesg and go ahead, otherwise iWarp needs to address this difference > with a proper generic API, which appears, must wrapper > ib_post_send. Harder job :\ > > I'm sorry Steve for leading you down the wrong path with these flags, > I did not fully realize exactly what the iWarp difference was at the > start :( Are there security issues? Yes. Are we going to solve them in this patch set? No. Especially since those security issues extend beyond iSER + iWARP. There are currently 4 RDMA capable link types/protocols. Of those 4, 3 enforce locality (IB, OPA, RoCE all require lossless fabrics and cannot be sent out into the wild internet). RFC5042 is written with iWARP in mind because it *can* be sent out over the Internet and therefore it is possible for someone to place an RNIC card directly on the Internet and it must therefore deal with all of the attack vectors this entails. The linux RDMA stack, as a whole, is no where near RFC5042 compliant. And it isn't just the kernel space at fault here. Early versions of opensm used to require that the master and slave machines must have passwordless root ssh access to each other to copy around the guid2lid file so when the master failed over, the slave would have an up to date list. Because 3 of the 4 RDMA technology types enforce localized clusters, the majority of the stack and the implementations have utilized the cluster as the point of trust. In truth, I think it's simply been the easy way to do things. RFC5042 is very clearly a non-trust model where each machine, and each connection, is assumed to be a threat. So, long story short(ish): While this discussion is taking place in the cleanup thread, part of it involves the preceding patch set: iSER support for iWARP. Patch 4/5 in the iSER/iWARP series enables the remote write on device->mr registration which constitutes the security issue we've been discussing. This cleanup patch set, then, is merely guilty of hiding that security issue, not introducing it. I think Steve and the other iSER folks should consider the wisdom of including the iWARP support as it stands. If they think that the common use case is in a sequestered cluster, and that the current trust domain is already not RFC5042 compliant, then I'm OK with enabling iSER support for iWARP, but I would want there to be a very clear warning in the dmesg about the trust domain issue. I might even suggest that the iSER driver, at least on iWARP connections, could possibly check the src/dst addresses to make sure that they are in a private IP domain and if they aren't, refuse to create a connection while emitting an additional warning about the trust domain. As far as the cleanup patches are concerned, I'm OK with those too. However, since we would have added a large warning in the iSER code about this issue, and that would then go away with the cleanup, the warning would need to be generalized (such as "module <foo> is requesting insecure iWARP mappings"), and appropriately WARN_ONCEd on probably a per module basis or something like that. From there, we can start to look at the bigger picture of cleanup up the default trust domain in the kernel and user space both (and soliciting feedback on that...I have a suspicion that some users will not like us tightening up security as it might interfere with their usage in their sequestered clusters).
On Jul 10, 2015, at 1:56 PM, Doug Ledford <dledford@redhat.com> wrote: > On 07/10/2015 12:11 PM, Jason Gunthorpe wrote: >> On Fri, Jul 10, 2015 at 09:22:24AM -0400, Tom Talpey wrote: >>> and it is enabled only when the RDMA Read is active. >> >> ??? How is that done? ib_get_dma_mr is defined to return a remote >> usable rkey that is valid from when it it returns until the MR is >> destroyed. NFS creates one mr early on, it does not seem to make one >> per RDMA READ? >> >> For the proposed iSER patch the problem is very acute, iser makes a >> single PD and phys MR at boot time for each device. This means there >> is a single machine wide unchanging rkey that allows remote physical >> memory access. An attacker only has to repeatedly open QPs to iSER and >> guess rkey values until they find it. Add in likely non-crypto >> randomness for rkeys, and I bet it isn't even that hard to do. >> >> So this seems to be a catastrophic security failing. > > As I recall, didn't the NFSoRDMA stack do even worse in the beginning > (meaning it previously had a memory model that was simply "register all > memory on the client and pass the rkey to the server and then tell the > server when/where to read/write", which was subsequently removed in > Chuck's NFSoRDMA cleanups over the last 5 or so kernel versions)? This registration strategy was not removed. It is available and used when the underlying HCA does not support either FRWR or FMR, or it can be selected via sysctl. I chose to leave it in for testing purposes. I know Christoph would like to see it removed because of the security issues you mention above. Others have asked that it be retained. Certainly I can remove it as a fallback choice, leaving it so that it can only be selected by sysctl. > I'm not bringing this up to downplay the current iSER issue, but to > demonstrate that we have long had a different trust model than the one > in RFC5042. More below on that. > >>>> From there, it is a logical wish: If Steve is going to FRMR'ize iSER >>>> to be acceptable security wise, I'd rather see that work done on in a >>>> general way. Hence the API discussion. >>> >>> Well, it's important to realize that NFS already does the Right Thing. >>> So it isn't actually an urgent issue. There is time to discuss. >> >> It depends on what is going on with iWarp iSER. If the iWarp community >> thinks it should go ahead insecurely then fine, put a giant warning in >> dmesg and go ahead, otherwise iWarp needs to address this difference >> with a proper generic API, which appears, must wrapper >> ib_post_send. Harder job :\ >> >> I'm sorry Steve for leading you down the wrong path with these flags, >> I did not fully realize exactly what the iWarp difference was at the >> start :( > > Are there security issues? Yes. Are we going to solve them in this > patch set? No. Especially since those security issues extend beyond > iSER + iWARP. > > There are currently 4 RDMA capable link types/protocols. Of those 4, 3 > enforce locality (IB, OPA, RoCE all require lossless fabrics and cannot > be sent out into the wild internet). RFC5042 is written with iWARP in > mind because it *can* be sent out over the Internet and therefore it is > possible for someone to place an RNIC card directly on the Internet and > it must therefore deal with all of the attack vectors this entails. > > The linux RDMA stack, as a whole, is no where near RFC5042 compliant. > And it isn't just the kernel space at fault here. Early versions of > opensm used to require that the master and slave machines must have > passwordless root ssh access to each other to copy around the guid2lid > file so when the master failed over, the slave would have an up to date > list. > > Because 3 of the 4 RDMA technology types enforce localized clusters, the > majority of the stack and the implementations have utilized the cluster > as the point of trust. In truth, I think it's simply been the easy way > to do things. RFC5042 is very clearly a non-trust model where each > machine, and each connection, is assumed to be a threat. > > So, long story short(ish): > > While this discussion is taking place in the cleanup thread, part of it > involves the preceding patch set: iSER support for iWARP. Patch 4/5 in > the iSER/iWARP series enables the remote write on device->mr > registration which constitutes the security issue we've been discussing. > This cleanup patch set, then, is merely guilty of hiding that security > issue, not introducing it. > > I think Steve and the other iSER folks should consider the wisdom of > including the iWARP support as it stands. If they think that the common > use case is in a sequestered cluster, and that the current trust domain > is already not RFC5042 compliant, then I'm OK with enabling iSER support > for iWARP, but I would want there to be a very clear warning in the > dmesg about the trust domain issue. I might even suggest that the iSER > driver, at least on iWARP connections, could possibly check the src/dst > addresses to make sure that they are in a private IP domain and if they > aren't, refuse to create a connection while emitting an additional > warning about the trust domain. > > As far as the cleanup patches are concerned, I'm OK with those too. > However, since we would have added a large warning in the iSER code > about this issue, and that would then go away with the cleanup, the > warning would need to be generalized (such as "module <foo> is > requesting insecure iWARP mappings"), and appropriately WARN_ONCEd on > probably a per module basis or something like that. > > From there, we can start to look at the bigger picture of cleanup up the > default trust domain in the kernel and user space both (and soliciting > feedback on that...I have a suspicion that some users will not like us > tightening up security as it might interfere with their usage in their > sequestered clusters). > > -- > Doug Ledford <dledford@redhat.com> > GPG KeyID: 0E572FDD > > -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/10/2015 1:56 PM, Doug Ledford wrote: > On 07/10/2015 12:11 PM, Jason Gunthorpe wrote: >> On Fri, Jul 10, 2015 at 09:22:24AM -0400, Tom Talpey wrote: >>> and it is enabled only when the RDMA Read is active. >> >> ??? How is that done? ib_get_dma_mr is defined to return a remote >> usable rkey that is valid from when it it returns until the MR is >> destroyed. NFS creates one mr early on, it does not seem to make one >> per RDMA READ? Actually you're right - the NFS server never tears down the MR. So, the old code is just as vulnerable as the new code. >> For the proposed iSER patch the problem is very acute, iser makes a >> single PD and phys MR at boot time for each device. This means there >> is a single machine wide unchanging rkey that allows remote physical >> memory access. An attacker only has to repeatedly open QPs to iSER and >> guess rkey values until they find it. Add in likely non-crypto >> randomness for rkeys, and I bet it isn't even that hard to do. The rkeys have a PD, wich cannot be forged, so it's not a matter of attacking, but it is most definitely a system integrity risk, as I mentioned earlier, a simple arithmetic offset mistake can overwrite anything. >> So this seems to be a catastrophic security failing. s/security/integrity/ and I agree. > As I recall, didn't the NFSoRDMA stack do even worse in the beginning > (meaning it previously had a memory model that was simply "register all > memory on the client and pass the rkey to the server and then tell the > server when/where to read/write", which was subsequently removed in > Chuck's NFSoRDMA cleanups over the last 5 or so kernel versions)? Yes, Doug, shamefully it did, but a) That was the client and Jason and I are talking about the server b) When that code was written (2007), FRMR did not exist, and other memory registration methods had abysmal performance c) It was heavily deprecated with console warnings emitted d) Chuck did away with it since then. > I'm not bringing this up to downplay the current iSER issue, but to > demonstrate that we have long had a different trust model than the one > in RFC5042. More below on that. I'll say something after requoting you. >>>> From there, it is a logical wish: If Steve is going to FRMR'ize iSER >>>> to be acceptable security wise, I'd rather see that work done on in a >>>> general way. Hence the API discussion. >>> >>> Well, it's important to realize that NFS already does the Right Thing. >>> So it isn't actually an urgent issue. There is time to discuss. >> >> It depends on what is going on with iWarp iSER. If the iWarp community >> thinks it should go ahead insecurely then fine, put a giant warning in >> dmesg and go ahead, otherwise iWarp needs to address this difference >> with a proper generic API, which appears, must wrapper >> ib_post_send. Harder job :\ >> >> I'm sorry Steve for leading you down the wrong path with these flags, >> I did not fully realize exactly what the iWarp difference was at the >> start :( > > Are there security issues? Yes. Are we going to solve them in this > patch set? No. Especially since those security issues extend beyond > iSER + iWARP. > > There are currently 4 RDMA capable link types/protocols. Of those 4, 3 > enforce locality (IB, OPA, RoCE all require lossless fabrics and cannot > be sent out into the wild internet). RFC5042 is written with iWARP in > mind because it *can* be sent out over the Internet and therefore it is > possible for someone to place an RNIC card directly on the Internet and > it must therefore deal with all of the attack vectors this entails. > > The linux RDMA stack, as a whole, is no where near RFC5042 compliant. > And it isn't just the kernel space at fault here. Early versions of > opensm used to require that the master and slave machines must have > passwordless root ssh access to each other to copy around the guid2lid > file so when the master failed over, the slave would have an up to date > list. > > Because 3 of the 4 RDMA technology types enforce localized clusters, the > majority of the stack and the implementations have utilized the cluster > as the point of trust. In truth, I think it's simply been the easy way > to do things. RFC5042 is very clearly a non-trust model where each > machine, and each connection, is assumed to be a threat. > > So, long story short(ish): > > While this discussion is taking place in the cleanup thread, part of it > involves the preceding patch set: iSER support for iWARP. Patch 4/5 in > the iSER/iWARP series enables the remote write on device->mr > registration which constitutes the security issue we've been discussing. > This cleanup patch set, then, is merely guilty of hiding that security > issue, not introducing it. > > I think Steve and the other iSER folks should consider the wisdom of > including the iWARP support as it stands. If they think that the common > use case is in a sequestered cluster, and that the current trust domain > is already not RFC5042 compliant, then I'm OK with enabling iSER support > for iWARP, but I would want there to be a very clear warning in the > dmesg about the trust domain issue. I might even suggest that the iSER > driver, at least on iWARP connections, could possibly check the src/dst > addresses to make sure that they are in a private IP domain and if they > aren't, refuse to create a connection while emitting an additional > warning about the trust domain. > > As far as the cleanup patches are concerned, I'm OK with those too. > However, since we would have added a large warning in the iSER code > about this issue, and that would then go away with the cleanup, the > warning would need to be generalized (such as "module <foo> is > requesting insecure iWARP mappings"), and appropriately WARN_ONCEd on > probably a per module basis or something like that. > > From there, we can start to look at the bigger picture of cleanup up the > default trust domain in the kernel and user space both (and soliciting > feedback on that...I have a suspicion that some users will not like us > tightening up security as it might interfere with their usage in their > sequestered clusters). All excellent points. It's not worse, and it adds important transport support. However, it's an extremely bad idea to codify writable privileged rmr's in the API as best practice. So under no circumstance should that become the long term plan. Tom. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 09, 2015 at 09:52:59AM -0400, Chuck Lever wrote:
> There is one remaining kernel user of ib_reg_phys_mr() in 4.2: Lustre.
It's in the staging tree, which proper in-tree code doesn't have to
cater for. So as soon as sunrpc is done using the interface we can and
should kill it off.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jul 10, 2015 at 01:56:36PM -0400, Doug Ledford wrote: > Are there security issues? Yes. Are we going to solve them in this > patch set? No. Especially since those security issues extend beyond > iSER + iWARP. I think my biggest concern is we don't inadvertently open a security hole on a machine that just happens to have an iwarp card installed, but has nothing to do with HPC. The clearest scary path I see is a server listening on a QP and using IB_ACCESS_REMOTE_WRITE. That sure looks easy to exploit. A client doing this.. It is alot harder to exploit.. iSER is client only, so less worrying. Can anyone else think of a way to attack the client? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/10/2015 04:57 PM, Jason Gunthorpe wrote: > On Fri, Jul 10, 2015 at 01:56:36PM -0400, Doug Ledford wrote: > >> Are there security issues? Yes. Are we going to solve them in this >> patch set? No. Especially since those security issues extend beyond >> iSER + iWARP. > > I think my biggest concern is we don't inadvertently open a security > hole on a machine that just happens to have an iwarp card installed, > but has nothing to do with HPC. Given the number of Chelsio cards utilized in things like TCP Offload and iSCSI offload situations versus HPC, this isn't an unreasonable thing to consider. However, the attack vector is limited to a situation where all of the below are true: 1) An RDMA connection exists or can be created (TOE and iSCSI offload do not do this, so something else would have to be listening and accepting incoming RDMA connections) 2) A global rkey exists (so it's not sufficient that any old RDMA app be running, at least one app/ULP *must* create the global writable rkey) 3) The QP of the RDMA connection belongs to the same PD as the global rkey (so our attack surface is limited to the bad server app/ULP and the existence of this does not put other apps/ULPs at risk) 4) For maximum damage, the global rkey must also belong to an app/ULP with elevated privilege or direct memory access (kernel ULPs are prime targets, as a user app would have a mapped address space and the global rkey in its domain would apply to it's process space, limiting the damage that can be done). Given all these requirements, I don't see this as a possibility. In order to be at risk, there *must* be a listening app/ULP, and we should be able to print out a nice, dire warning in dmesg if that app/ULP opens itself up in this way. > The clearest scary path I see is a server listening on a QP and using > IB_ACCESS_REMOTE_WRITE. That sure looks easy to exploit. This goes back to the trust domain. For a server, you simply can't allow untrusted clients. Period. But that's easy enough to do with access controls and connection filtering. The app/ULP itself doesn't even need to be filter aware as you can do the filtering in the TCP stack on the primary listening socket using the netfilter tools. And that goes back to the kernel warning I referred to in my previous email. Let users know what module is making this risky decision and it becomes easy to filter any ports that module's services are listening on. > A client doing this.. It is alot harder to exploit.. iSER is client > only, so less worrying. Can anyone else think of a way to attack the > client? TCP injection only is all I've got.
On 07/10/2015 02:42 PM, Tom Talpey wrote: > However, it's an extremely bad idea to codify writable privileged rmr's > in the API as best practice. So under no circumstance should that become > the long term plan. Agree 100%. Which is why I requested a big warning in the dmesg output that pointed a big, accusatory finger at the offending module with the long term goal of reforming the memory registration models to a method that eliminates this problem in the kernel.
On 07/10/2015 07:34 PM, Jason Gunthorpe wrote: > On Fri, Jul 10, 2015 at 06:27:59PM -0400, Doug Ledford wrote: > >> 1) An RDMA connection exists or can be created (TOE and iSCSI offload do >> not do this, so something else would have to be listening and accepting >> incoming RDMA connections) >> 2) A global rkey exists (so it's not sufficient that any old RDMA app be >> running, at least one app/ULP *must* create the global writable rkey) >> 3) The QP of the RDMA connection belongs to the same PD as the global >> rkey (so our attack surface is limited to the bad server app/ULP and the >> existence of this does not put other apps/ULPs at risk) >> 4) For maximum damage, the global rkey must also belong to an app/ULP >> with elevated privilege or direct memory access (kernel ULPs are prime >> targets, as a user app would have a mapped address space and the global >> rkey in its domain would apply to it's process space, limiting the >> damage that can be done). > > This list looks right to me. Good, we're in agreement so far ;-) >> Given all these requirements, I don't see this as a possibility. In > > Hmm. So, if I put an iWarp card in a machine, boot any distro, and add > something to /etc/exports .. > > Does the NFS RDMA kernel module load and start a listening QP? That depends on the OS. > If not, that actually sounds like a bug. What if a distro fixes that? Red Hat requires that you enable NFSoRDMA. But, your point is valid. That goes back to my point about the update patch set being very verbose about the dangers of this. The very obvious kernel message is helpful in this case. >>> The clearest scary path I see is a server listening on a QP and using >>> IB_ACCESS_REMOTE_WRITE. That sure looks easy to exploit. >> >> This goes back to the trust domain. For a server, you simply can't >> allow untrusted clients. Period. > > This feels like an antiquated security model. Most things these days > are using in-band mutual auth and then switch to a trusted mode. Both > NFS and iSCSI support that (kerberos,chap,etc), I assume that generic > support extends to RDMA. > > If you use auth then trust as a security model, this rkey buisness is > a problem, because you should not be vunerable to attack before > authing. Yes. This is an antiquated security model. There is a reason for that. It is easier to go simple to begin with and then add the extra layers needed to keep things safe once you have the basics right. >> access controls and connection filtering. The app/ULP itself doesn't >> even need to be filter aware as you can do the filtering in the TCP >> stack on the primary listening socket using the netfilter tools. > > Does netfilter work for iWarp? I'm surprised to hear that. iWARP requires a normal TCP socket to connect to, then the client must initiate an RDMA transfer, then a new connection is opened for the RDMA transfers. Blocking the parent dst:port/*:* will prevent these connections. If you are referring to allowing an untrusted client in TCP mode but blocking them in RDMA mode, that's more complex and requires app/ULP support. >> that goes back to the kernel warning I referred to in my previous email. >> Let users know what module is making this risky decision and it becomes >> easy to filter any ports that module's services are listening on. > > That's fine, as long as the user is opting it this situation. If I'm > not using RDMA but have an iWarp card, I should never see the > message, even if RDMA support is autoloaded by my distro... See above, an unexpected vulnerability caused by this model should result in a very obvious kernel message. >>> A client doing this.. It is alot harder to exploit.. iSER is client >>> only, so less worrying. Can anyone else think of a way to attack the >>> client? >> >> TCP injection only is all I've got. > > Black hat server can also do it, and since it is possible before auth > it is doable without the server auth credential. Black hat server is beyond the scope of this discussion.
On Thu, Jul 09, 2015 at 11:01:42AM -0600, Jason Gunthorpe wrote: > To your point in another message, I'd say, as long as the new API > supports FRMR at full speed with no performance penalty we are > good. If the other variants out there take a performance hit, then I > think that is OK. As you say, they are on the way out, we just need to > make sure that ULPs continue to work with FMR with the new API so > legacy cards don't completely break. I think what we need to support for now is FRMR as the primary target, and FMR as a secondard. A few in-kernel drivers support physical registrations, but it's always been clear that they've been a bad idea, and the current discussion reconfirms that again. So a FRMR-like API that allows to use FMRs underneath might be a good start. If the imput to that API are S/G lists as in my suggestion we'll get indirect registration support almost for free. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jul 10, 2015 at 11:55:29AM +0300, Sagi Grimberg wrote: > If there is one thing worse than a complicated API, it is a restrictive > one. I'd much rather ULPs just having a simple API for registering > memory. Quite to the contrary. The complex API almost asks for weird abuses and twists. The too restrictive one means people that want to extend it need to start a discussion on how we extent the API, which improves the chances to come up with something sensible dramatically. > >I expect all these calls would be function pointers, and each driver > >would provide a function pointer that is optimal for it's use. Eg mlx4 > >would provide a pointer that used the S/G list, then falls back to > >FRMR if the S/G list is exhausted. The core code would provide a > >toolbox of common functions the drivers can use here. > > Maybe it's just me, but I can't help but wander if this is facilitating > an atmosphere where drivers will keep finding new ways to abuse even > the most simple operations. I'm not too excited about moving the code in the drivers. The RDMA subsystem actually has a lot more hardware drivers than ULDs, so moving logic into them seems like a major step backwards. From my journeys into the drivers it rather seems like they are doing too much work already. Having a few schemes availabe in the core code that the driver can chose from seems like a much more sensible option. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Jul 10, 2015, at 9:11 AM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > >> On Fri, Jul 10, 2015 at 09:22:24AM -0400, Tom Talpey wrote: >> and it is enabled only when the RDMA Read is active. > > ??? How is that done? ib_get_dma_mr is defined to return a remote > usable rkey that is valid from when it it returns until the MR is > destroyed. NFS creates one mr early on, it does not seem to make one > per RDMA READ? > > For the proposed iSER patch the problem is very acute, iser makes a > single PD and phys MR at boot time for each device. This means there > is a single machine wide unchanging rkey that allows remote physical > memory access. An attacker only has to repeatedly open QPs to iSER and > guess rkey values until they find it. Add in likely non-crypto > randomness for rkeys, and I bet it isn't even that hard to do. > > So this seems to be a catastrophic security failing. > >>> From there, it is a logical wish: If Steve is going to FRMR'ize iSER >>> to be acceptable security wise, I'd rather see that work done on in a >>> general way. Hence the API discussion. >> >> Well, it's important to realize that NFS already does the Right Thing. >> So it isn't actually an urgent issue. There is time to discuss. > > It depends on what is going on with iWarp iSER. If the iWarp community > thinks it should go ahead insecurely then fine, put a giant warning in > dmesg and go ahead, otherwise iWarp needs to address this difference > with a proper generic API, which appears, must wrapper > ib_post_send. Harder job :\ > > I'm sorry Steve for leading you down the wrong path with these flags, > I did not fully realize exactly what the iWarp difference was at the > start :( > > Jason No problem. I'll work on iSER target FRMRs and repost the iSER series. Sagi, right now isert only uses FRMRs along with signature mrs. I'll need to separate the two, I think. Does that sound right?-- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/10/2015 10:34 PM, Christoph Hellwig wrote: > On Thu, Jul 09, 2015 at 09:52:59AM -0400, Chuck Lever wrote: >> There is one remaining kernel user of ib_reg_phys_mr() in 4.2: Lustre. > > It's in the staging tree, which proper in-tree code doesn't have to > cater for. So as soon as sunrpc is done using the interface we can and > should kill it off. > I think we should probably ask the Lustre folks if they have a real use case for it before we remove it completely. It seems that Lustre strives to FMRs and if it is not supported it uses a PHYS_MR. Given FMRs has an expiration date and PHYS_MR is basically alloc/free of MRs in the data path (which is obviously not desirable in any form). Lustre will need to refresh their code to use the standard FRWR. Do we have Lustre folks listening on the mailing list? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/11/2015 7:37 PM, Steve Wise wrote: > >> On Jul 10, 2015, at 9:11 AM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: >> >>> On Fri, Jul 10, 2015 at 09:22:24AM -0400, Tom Talpey wrote: >>> and it is enabled only when the RDMA Read is active. >> >> ??? How is that done? ib_get_dma_mr is defined to return a remote >> usable rkey that is valid from when it it returns until the MR is >> destroyed. NFS creates one mr early on, it does not seem to make one >> per RDMA READ? >> >> For the proposed iSER patch the problem is very acute, iser makes a >> single PD and phys MR at boot time for each device. This means there >> is a single machine wide unchanging rkey that allows remote physical >> memory access. An attacker only has to repeatedly open QPs to iSER and >> guess rkey values until they find it. Add in likely non-crypto >> randomness for rkeys, and I bet it isn't even that hard to do. >> >> So this seems to be a catastrophic security failing. >> >>>> From there, it is a logical wish: If Steve is going to FRMR'ize iSER >>>> to be acceptable security wise, I'd rather see that work done on in a >>>> general way. Hence the API discussion. >>> >>> Well, it's important to realize that NFS already does the Right Thing. >>> So it isn't actually an urgent issue. There is time to discuss. >> >> It depends on what is going on with iWarp iSER. If the iWarp community >> thinks it should go ahead insecurely then fine, put a giant warning in >> dmesg and go ahead, otherwise iWarp needs to address this difference >> with a proper generic API, which appears, must wrapper >> ib_post_send. Harder job :\ >> >> I'm sorry Steve for leading you down the wrong path with these flags, >> I did not fully realize exactly what the iWarp difference was at the >> start :( >> >> Jason > > > No problem. I'll work on iSER target FRMRs and repost the iSER series. > > Sagi, right now isert only uses FRMRs along with signature mrs. I'll need to separate the two, I think. Does that sound right? Yea. Given that FRWR takes extra HW (and memory) resources, it should probably be: if (signature support || iwarp) use FRMR -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Jul 11, 2015 at 03:25:38AM -0700, 'Christoph Hellwig' wrote: > So a FRMR-like API that allows to use FMRs underneath might be a good > start. If the imput to that API are S/G lists as in my suggestion we'll > get indirect registration support almost for free. That would be an excellent outcome in my view. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Jul 11, 2015 at 03:31:53AM -0700, 'Christoph Hellwig' wrote: > I'm not too excited about moving the code in the drivers. The RDMA > subsystem actually has a lot more hardware drivers than ULDs, so moving > logic into them seems like a major step backwards. From my journeys > into the drivers it rather seems like they are doing too much work > already. Yes, that is true.. > Having a few schemes availabe in the core code that the driver can chose > from seems like a much more sensible option. I think that makes sense, but several of the schemes we are working with are effectively single-vendor schemes. Indirect MR and DIX are good examples of things that only one vendor implements, and are not standardized. In those cases it is hard to argue the core should provide support for them. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jul 12, 2015 at 10:49:08AM +0300, Sagi Grimberg wrote: > On 7/10/2015 10:34 PM, Christoph Hellwig wrote: > >On Thu, Jul 09, 2015 at 09:52:59AM -0400, Chuck Lever wrote: > >>There is one remaining kernel user of ib_reg_phys_mr() in 4.2: Lustre. > > > >It's in the staging tree, which proper in-tree code doesn't have to > >cater for. So as soon as sunrpc is done using the interface we can and > >should kill it off. > > > > I think we should probably ask the Lustre folks if they have a real use > case for it before we remove it completely. I had an interesting conversation with some Lustre devs where they were very concerned that FMR was going to be removed and they didn't really seem to even know that FRMR existed. I'm sure their PHYS_MR usage is crufty old code to support old adaptors. And again, here is a great example of how the API is not helping the ULPs do what they want to do. Lustre doesn't care one bit about this stuff, they just want to send messages.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jul 10, 2015 at 11:10:23PM -0400, Doug Ledford wrote: > >> access controls and connection filtering. The app/ULP itself doesn't > >> even need to be filter aware as you can do the filtering in the TCP > >> stack on the primary listening socket using the netfilter tools. > > > > Does netfilter work for iWarp? I'm surprised to hear that. > > iWARP requires a normal TCP socket to connect to, then the client must > initiate an RDMA transfer, then a new connection is opened for the RDMA > transfers. Blocking the parent dst:port/*:* will prevent these > connections. If you are referring to allowing an untrusted client in > TCP mode but blocking them in RDMA mode, that's more complex and > requires app/ULP support. Yes, that would be the use case here. If someone wishes to deploy auth-then-trust in TCP mode with NFS/iSCSI (which is a kernel supported TCP/IP mode) we need to be absolutely certain there is no way for anything untrusted to pivot a connection into a RDMA mode and exploit the RKEY problem. Out of the box this must be impossible. The surest way to guarentee that is to have this hack off by default. > Black hat server is beyond the scope of this discussion. We cannot assume an all-trusted model here, there are many configurations to deploy NFS/iSCSI that don't assume that. Even if you assume it for the RDMA cases (which I stronlgy disagree with), it still must be proven to not weaken the existing TCP/IP case. So, a black hat server is on the table, attacking a client that the admin is not intending to use with RDMA, by forcing it to switch to RDMA before auth and exploiting the RDMA side. This is where the iwarp guys have to analyze and come back to say it is OK. Maybe iwarp can't get to rdma without auth or something... Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/11/2015 6:25 AM, 'Christoph Hellwig' wrote: > I think what we need to support for now is FRMR as the primary target, > and FMR as a secondar[y]. FMR is a *very* bad choice, for several reasons. 1) It's not supported by very many devices, in fact it might even be thought to be obsolete. 2) It does not protect on byte boundaries, therefore it is unsafe to use for anything but page-sized, page-aligned RDMA operations. 3) It is a synchronous API, i.e. it is not work-request based and therefore is not very high performance. 4) It sometimes is used with a "pool", which defers deregistration in the hopes of amortizing overhead. However, this deferral further increases the risk of remote access, including altering of memory contents after the fact. Personally, I'd recommend ib_get_phys_mr() over FMR. It at least doesn't suffer from issues 1, 2 and 4. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 13, 2015 at 03:36:44PM -0400, Tom Talpey wrote: > On 7/11/2015 6:25 AM, 'Christoph Hellwig' wrote: > >I think what we need to support for now is FRMR as the primary target, > >and FMR as a secondar[y]. > > FMR is a *very* bad choice, for several reasons. If an API can transparently support FMR, then I think it can also transparently support ib_get_phys_mr as an alternative, they look pretty similar... ? > Personally, I'd recommend ib_get_phys_mr() over FMR. It at least > doesn't suffer from issues 1, 2 and 4. Your comments are right for the rkey case, but for lkey, there is no security concern with using a FMR, or pooling them. It doesn't look like any iwarp drivers supports FMR, so they are certainly safe to use on IB as the lkey. This is why I am becoming more convinced that treating lkey and rkey the same is not helpful. Conversely, it looks like if we could drop ehca and mthca we could ditch FMR entirely.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/13/2015 1:18 PM, Jason Gunthorpe wrote: > On Fri, Jul 10, 2015 at 11:10:23PM -0400, Doug Ledford wrote: >> Black hat server is beyond the scope of this discussion. > > We cannot assume an all-trusted model here, there are many > configurations to deploy NFS/iSCSI that don't assume that. Even if you > assume it for the RDMA cases (which I stronlgy disagree with), it > still must be proven to not weaken the existing TCP/IP case. > > So, a black hat server is on the table, attacking a client that the > admin is not intending to use with RDMA, by forcing it to switch to > RDMA before auth and exploiting the RDMA side. > > This is where the iwarp guys have to analyze and come back to say it > is OK. Maybe iwarp can't get to rdma without auth or something... Two observations. One, auth is an Upper Layer matter. It's not the job of the transport to authenticate the peer, the user, etc. Upper layers do this, and iSCSI performs a login, NFSv4.1+ creates a session, SMB3 creates sessions on multiple connections, etc. All this happens after the connection is established. Two, the iSCSI and NFSv4.1 protocols have explicit support for iWARP "step-up" mode, which supports an initial connection in TCP (i.e. with RDMA disabled), and switching to RDMA mode dynamically. The IB and RoCE protocols do not support step-up mode, so in fact one could argue that iWARP is *better* positioned to support this. Tom. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 13, 2015 at 03:36:44PM -0400, Tom Talpey wrote: > On 7/11/2015 6:25 AM, 'Christoph Hellwig' wrote: > >I think what we need to support for now is FRMR as the primary target, > >and FMR as a secondar[y]. > > FMR is a *very* bad choice, for several reasons. > > 1) It's not supported by very many devices, in fact it might even > be thought to be obsolete. It's support by the Mellanox adapters, which might be a minority of the drivers, but at least in the field I've been in the aboslute majority of the deployed hardware. It's the default in the SRP initiator and iSER initiators, and the primary fallback in the NFS client if FRs aren't available. I don't claim to be an expert on memory registrations models, as they are horribly, horrible documents (baiscally not at all in the source tree), so my knowledge is from looking at and using implementations as well as this useful writeup from the NFS folks: http://wiki.linux-nfs.org/wiki/index.php/NfsRdmaClient/MemRegModes Which misses or downplays an important restriction of PHYS registrations: They dont allow coalescing non-contiguous memory into a single maping, which makes them totally unsuitable for iSER which only allows a single rkey/offset/len pair and requires additional RDMA READ/WRITE roundtrips and larger S/G lists for other protocols. Based on looking at the consumers and the above table I think FMR are still the better fallback for devices that don't support FR, but supporting PHYS MRs is easy enough that adding it to a common layer seems easy enough if some cares enough to regularly test it. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/13/2015 7:50 PM, Jason Gunthorpe wrote: > On Sun, Jul 12, 2015 at 10:49:08AM +0300, Sagi Grimberg wrote: >> On 7/10/2015 10:34 PM, Christoph Hellwig wrote: >>> On Thu, Jul 09, 2015 at 09:52:59AM -0400, Chuck Lever wrote: >>>> There is one remaining kernel user of ib_reg_phys_mr() in 4.2: Lustre. >>> >>> It's in the staging tree, which proper in-tree code doesn't have to >>> cater for. So as soon as sunrpc is done using the interface we can and >>> should kill it off. >>> >> >> I think we should probably ask the Lustre folks if they have a real use >> case for it before we remove it completely. > > I had an interesting conversation with some Lustre devs where they > were very concerned that FMR was going to be removed and they didn't really > seem to even know that FRMR existed. > > I'm sure their PHYS_MR usage is crufty old code to support old > adaptors. > > And again, here is a great example of how the API is not helping the > ULPs do what they want to do. Lustre doesn't care one bit about this > stuff, they just want to send messages.. All protocols cares about transferring data and sending messages, so it's not a good enough reason for a poor registration method choice. This just emphasizes why we need to converge to a single method. Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> >> Having a few schemes availabe in the core code that the driver can chose >> from seems like a much more sensible option. > > I think that makes sense, but several of the schemes we are working > with are effectively single-vendor schemes. Indirect MR and DIX are > good examples of things that only one vendor implements, and are not > standardized. > > In those cases it is hard to argue the core should provide support for > them. I'm sorry but I disagree with the statement above. T10-DIF/DIX is a standard and *not* a vendor specific feature. Other storage protocols (FC, SAS, NVME) fully support it. So yes, the core *needs* to support it. Moreover, the core stack poses memory alignment limitations for registration. The RDMA stack should allow an interface that can resolve these limitations if the device is capable of handling them. I'd say this is something all vendors should look at. Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/13/2015 11:15 PM, Jason Gunthorpe wrote: > On Mon, Jul 13, 2015 at 03:36:44PM -0400, Tom Talpey wrote: >> On 7/11/2015 6:25 AM, 'Christoph Hellwig' wrote: >>> I think what we need to support for now is FRMR as the primary target, >>> and FMR as a secondar[y]. >> >> FMR is a *very* bad choice, for several reasons. > > If an API can transparently support FMR, then I think it can also > transparently support ib_get_phys_mr as an alternative, they look > pretty similar... ? Having an API that does FRMR/FMR/PHYS_MR is even worse from the ULP PoV. If you expose an API that might schedule (PHYS_MR) it limits the context that the caller is allowed to call in. I'm 100% against an registration API that *might* schedule. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/14/2015 10:37 AM, 'Christoph Hellwig' wrote: > On Mon, Jul 13, 2015 at 03:36:44PM -0400, Tom Talpey wrote: >> On 7/11/2015 6:25 AM, 'Christoph Hellwig' wrote: >>> I think what we need to support for now is FRMR as the primary target, >>> and FMR as a secondar[y]. >> >> FMR is a *very* bad choice, for several reasons. >> >> 1) It's not supported by very many devices, in fact it might even >> be thought to be obsolete. > > It's support by the Mellanox adapters, *Older* Mellanox adapters. mlx5 (and future drivers I assume) will no longer support FMRs. > which might be a minority of > the drivers, but at least in the field I've been in the aboslute > majority of the deployed hardware. > > It's the default in the SRP initiator and iSER initiators, and the > primary fallback in the NFS client if FRs aren't available. We should gradually move away from that. > > I don't claim to be an expert on memory registrations models, as they > are horribly, horrible documents (baiscally not at all in the source > tree), so my knowledge is from looking at and using implementations as > well as this useful writeup from the NFS folks: > > http://wiki.linux-nfs.org/wiki/index.php/NfsRdmaClient/MemRegModes > > Which misses or downplays an important restriction of PHYS > registrations: They dont allow coalescing non-contiguous memory > into a single maping, which makes them totally unsuitable for iSER > which only allows a single rkey/offset/len pair and requires > additional RDMA READ/WRITE roundtrips and larger S/G lists for other > protocols. I'm willing to add a proper memory_registration.txt under Documentation/infiniband/ which would capture the items that were brought up here. > > Based on looking at the consumers and the above table I think FMR > are still the better fallback for devices that don't support FR, It's better if you want it fast. I can't stress it enough, but IMO, the fallback should *not* be in the API, but rather in the ULP. Ideally, at some point it won't need to fall back, and we can remove the API. > but supporting PHYS MRs is easy enough that adding it to a common > layer seems easy enough if some cares enough to regularly test it. > Someone needs a good reason to use it. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/14/2015 5:22 AM, Sagi Grimberg wrote: > On 7/14/2015 10:37 AM, 'Christoph Hellwig' wrote: >> On Mon, Jul 13, 2015 at 03:36:44PM -0400, Tom Talpey wrote: >>> On 7/11/2015 6:25 AM, 'Christoph Hellwig' wrote: >>>> I think what we need to support for now is FRMR as the primary target, >>>> and FMR as a secondar[y]. >>> >>> FMR is a *very* bad choice, for several reasons. >>> >>> 1) It's not supported by very many devices, in fact it might even >>> be thought to be obsolete. >> >> It's support by the Mellanox adapters, > > *Older* Mellanox adapters. mlx5 (and future drivers I assume) will > no longer support FMRs. Right, but drop the word "will". Mlx5 (the current ConnectX-4) doesn't support FMR. It's only supported by mlx4 and mthca drivers. >> Based on looking at the consumers and the above table I think FMR >> are still the better fallback for devices that don't support FR, > > It's better if you want it fast. Do you guys think FMR is actually "fast"? Because it's not. Measure it. It might have been marginally faster than other schemes in like, 2007, and only when the mempool was there to leave the registrations open. Don't go back there. Tom. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/14/2015 4:06 AM, Sagi Grimberg wrote: > All protocols cares about transferring data and sending messages, so > it's not a good enough reason for a poor registration method choice. > This just emphasizes why we need to converge to a single method. In my opinion, we already have it. For local registrations, ib_reg_phys_mr()/ib_get_dma_mr(). These are not quite equivalent, btw. For remote registrations, ib_post_send(FRMR). Unfortunately, there exist ancient adapters in-tree that don't support FRMR, and some ULPs have attempted to work on them. That's why the situation is confusing. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/14/2015 3:24 PM, Tom Talpey wrote: > On 7/14/2015 4:06 AM, Sagi Grimberg wrote: >> All protocols cares about transferring data and sending messages, so >> it's not a good enough reason for a poor registration method choice. >> This just emphasizes why we need to converge to a single method. > > In my opinion, we already have it. > > For local registrations, ib_reg_phys_mr()/ib_get_dma_mr(). These are not > quite equivalent, btw. > > For remote registrations, ib_post_send(FRMR). > > Unfortunately, there exist ancient adapters in-tree that don't support > FRMR, and some ULPs have attempted to work on them. That's why the > situation is confusing. > Exactly. This is why I'd prefer not to make an effort to have a unified API that maintains any form of (confusing) fallback policy. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/14/2015 3:12 PM, Tom Talpey wrote: > On 7/14/2015 5:22 AM, Sagi Grimberg wrote: >> On 7/14/2015 10:37 AM, 'Christoph Hellwig' wrote: >>> On Mon, Jul 13, 2015 at 03:36:44PM -0400, Tom Talpey wrote: >>>> On 7/11/2015 6:25 AM, 'Christoph Hellwig' wrote: >>>>> I think what we need to support for now is FRMR as the primary target, >>>>> and FMR as a secondar[y]. >>>> >>>> FMR is a *very* bad choice, for several reasons. >>>> >>>> 1) It's not supported by very many devices, in fact it might even >>>> be thought to be obsolete. >>> >>> It's support by the Mellanox adapters, >> >> *Older* Mellanox adapters. mlx5 (and future drivers I assume) will >> no longer support FMRs. > > Right, but drop the word "will". Mlx5 (the current ConnectX-4) doesn't > support FMR. It's only supported by mlx4 and mthca drivers. > >>> Based on looking at the consumers and the above table I think FMR >>> are still the better fallback for devices that don't support FR, >> >> It's better if you want it fast. > > Do you guys think FMR is actually "fast"? Because it's not. Measure it. > It might have been marginally faster than other schemes in like, 2007, > and only when the mempool was there to leave the registrations open. > Don't go back there. I agree, I meant the pool API which has obvious problems... Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > I'm willing to add a proper memory_registration.txt under > Documentation/infiniband/ which would capture the items that > were brought up here. > This would be great. Documentation/infiniband is very sparse... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 14, 2015 at 12:10:36PM +0300, Sagi Grimberg wrote: > Having an API that does FRMR/FMR/PHYS_MR is even worse from the ULP > PoV. If you expose an API that might schedule (PHYS_MR) it limits the > context that the caller is allowed to call in. > > I'm 100% against an registration API that *might* schedule. Oh, I had missed that PHYS_MR might sleep. That might be the reasons why everyone is avoiding them despite Tom preferring them over FMR. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 14, 2015 at 12:22:14PM +0300, Sagi Grimberg wrote: > It's better if you want it fast. I can't stress it enough, but IMO, the > fallback should *not* be in the API, but rather in the ULP. > Ideally, at some point it won't need to fall back, and we can remove > the API. But if all driver need to impement the same dispatch between FR/FMR/ALLPHYS and having similar fast path hacks something is wrong. Maybe we should start but just implementing the helper you suggest in separae FR/FMR/ALLPHYS handlers while phasing out hacks like use ALLPHYS for single segment registrations or if other registrations fail. And as a second step we can decide if their are similar enough to consolidate them, or if we should simply kick out FMR and/or ALLPHYS and thus not support any in-kernel consumers for old HCAs. n the meantime let's kick off the now aready unuses memwindow in-kernel API (I fear the userlevel one will have to stay forever) and the about to be unused by proper in-kernel code PHYS MRs. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/14/2015 11:36 AM, 'Christoph Hellwig' wrote: > On Tue, Jul 14, 2015 at 12:10:36PM +0300, Sagi Grimberg wrote: >> Having an API that does FRMR/FMR/PHYS_MR is even worse from the ULP >> PoV. If you expose an API that might schedule (PHYS_MR) it limits the >> context that the caller is allowed to call in. >> >> I'm 100% against an registration API that *might* schedule. > > Oh, I had missed that PHYS_MR might sleep. That might be the reasons > why everyone is avoiding them despite Tom preferring them over FMR. Any synchronous verb might sleep. They issue commands to the adapter across the PCIe bus, and need to wait for the result. FMR does, too. At least the work-request-based ones are explicit about waiting for the completion. Are you saying you want an API that does memory registration without blocking AND without sending a work request to hardware, ever? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 14, 2015 at 08:36:19AM -0700, 'Christoph Hellwig' wrote: > Oh, I had missed that PHYS_MR might sleep. That might be the reasons > why everyone is avoiding them despite Tom preferring them over FMR. Yep, almost certainly. But even that is just a legacy of the bad API. Even Sagi's API idea can accommodate this with enough driver effort, but only if posting is combined: - alloc_mr: Just set aside any memory the driver needs for PHYS_MR - set_sg_list_and_post: + This would issue the PHYS_MR call to the NIC, but not sleep. It would adjust the SQ so that the tail pointer is blocked and the NIC doesn't see any more posts. Effectively, the SQ stalls. This way the ULP can post more stuff and ordering is preserved. + It would then setup a callback for PHYS_MR NIC completion and return, having never slept. + The callback will unblock the SQ Yes, this is complicated, but it shows how combing MR setup and post together lets us do a lot more. Sadly, we could probably never do this for older drivers due to them being unmaintained, but it is certainly possible the core could provide an older driver wrapper that emulates the above with less efficiency using threads/queues/etc. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 14, 2015 at 11:24:26AM +0300, Sagi Grimberg wrote: > > > > >>Having a few schemes availabe in the core code that the driver can chose > >>from seems like a much more sensible option. > > > >I think that makes sense, but several of the schemes we are working > >with are effectively single-vendor schemes. Indirect MR and DIX are > >good examples of things that only one vendor implements, and are not > >standardized. > > > >In those cases it is hard to argue the core should provide support for > >them. > > I'm sorry but I disagree with the statement above. T10-DIF/DIX is a > standard and *not* a vendor specific feature. Other storage protocols > (FC, SAS, NVME) fully support it. So yes, the core *needs* to support it. Sorry, I ment the exact details of how DIX is exposed through verbs is only implemented by one vendor (eg IB_WR_REG_SIG_MR). The core absolutely should provide an API for this, I'm not convinced it should be at the MR level though.... > Moreover, the core stack poses memory alignment limitations for > registration. The RDMA stack should allow an interface that can resolve > these limitations if the device is capable of handling them. I'd say > this is something all vendors should look at. Sure would be nice.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> >> > >> I'm sorry Steve for leading you down the wrong path with these flags, > >> I did not fully realize exactly what the iWarp difference was at the > >> start :( > >> > >> Jason > > > > > > No problem. I'll work on iSER target FRMRs and repost the iSER series. > > > > Sagi, right now isert only uses FRMRs along with signature mrs. I'll need to separate the two, I think. Does that sound right? > > Yea. > > Given that FRWR takes extra HW (and memory) resources, it > should probably be: > > if (signature support || iwarp) > use FRMR Currently the code does: if (device_supports_fastreg && device_supports_signature) use FRMR else use DMAMR Shouldn't we just recode it this way? if (device_supports_fastreg) use FRMR else use DMAMR The benefit is that we don't have to check for iWARP protocol in the ULP. The side effect is, I think, mlx4 will now use FRMR instead of DMAMR for reads/writes since it doesn't support signature handover. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 14, 2015 at 02:25:50PM -0500, Steve Wise wrote: > The benefit is that we don't have to check for iWARP protocol in the > ULP. IB should have to pay the cost of FRMR for lkey on RDMA READ, I'm pretty sure you have to check for iWarp at somepoint.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > On Tue, Jul 14, 2015 at 02:25:50PM -0500, Steve Wise wrote: > > > The benefit is that we don't have to check for iWARP protocol in the > > ULP. > > IB should have to pay the cost of FRMR for lkey on RDMA READ, I'm > pretty sure you have to check for iWarp at somepoint.. > You mean "should not", yea? Ok. I'll check for iWARP. But don't tell me to remove the transport-specific hacks in this series when I post it! ;) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 14, 2015 at 02:32:31PM -0500, Steve Wise wrote:
> Ok. I'll check for iWARP. But don't tell me to remove the transport-specific hacks in this series when I post it! ;)
Sure thing! Wonder what the test should be? cap_remote_access_lkey? No idea.
Still think this should all be hidden in a wrapper for posting RDMA READ :|
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 14, 2015 at 02:25:50PM -0500, Steve Wise wrote: > if (device_supports_fastreg && device_supports_signature) > use FRMR > else > use DMAMR > > Shouldn't we just recode it this way? > > if (device_supports_fastreg) > use FRMR > else > use DMAMR How does IB_DEVICE_LOCAL_DMA_LKEY (and ->local_dma_lkey) play into this? Seems like that should be the preferred option if supported. Interestingly enough various iWarp driver seem to support this option, what's the story behind that? The (to me surprising) conclusion on the list was that iWarp would always need a memory regireations that also allows remove writes even for lkeys, but from looking at the users of IB_DEVICE_LOCAL_DMA_LKEY / local_dma_lkey that seem to prefer that over creating a MR. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 14, 2015 at 02:32:31PM -0500, Steve Wise wrote: > You mean "should not", yea? > > Ok. I'll check for iWARP. But don't tell me to remove the transport-specific hacks in this series when I post it! ;) Just curious if there are any holes in this little scheme to deal with the lkey mess: (1) make sure all drivers that currently do not set IB_DEVICE_LOCAL_DMA_LKEY but which can safely use ib_get_dma_mr call it underneath at device setup time, and tear it down before removal. (2) now ULD can check for IB_DEVICE_LOCAL_DMA_LKEY and use local_dma_lkey in that case, or will have to perform a per-IO MR with LOCAL and REMOTE flags if not (3) remove insecure remote uses of ib_get_dma_mr from ULDs (4) remove ib_get_dma_mr from the public API This should help to sort out the lkey side of the memory registrations, and isn't too hard. For rkeys I'd love to go with something like Sagis proposal as a first step, and then we can see if we can refine it in another iteration. Given that we might not be able to do the above for the next merge window add your iWarp transport heck for now, at least we'll have a clear plan to remove it. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 14, 2015 at 12:45:12PM -0700, 'Christoph Hellwig' wrote: > How does IB_DEVICE_LOCAL_DMA_LKEY (and ->local_dma_lkey) play into > this? Seems like that should be the preferred option if supported. Very good question. > Interestingly enough various iWarp driver seem to support this option, > what's the story behind that? The (to me surprising) conclusion on > the list was that iWarp would always need a memory regireations that Only for the RDMA READ lkey. All the other verbs are the same as IB, and should use a local access all physical MR lkey. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: 'Christoph Hellwig' [mailto:hch@infradead.org] > Sent: Tuesday, July 14, 2015 2:45 PM > To: Steve Wise > Cc: 'Sagi Grimberg'; 'Steve Wise'; 'Jason Gunthorpe'; 'Tom Talpey'; 'Doug Ledford'; 'Christoph Hellwig'; sagig@mellanox.com; > ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-devel@vger.kernel.org; linux- > nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org; 'Oren Duer' > Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags > > On Tue, Jul 14, 2015 at 02:25:50PM -0500, Steve Wise wrote: > > if (device_supports_fastreg && device_supports_signature) > > use FRMR > > else > > use DMAMR > > > > Shouldn't we just recode it this way? > > > > if (device_supports_fastreg) > > use FRMR > > else > > use DMAMR > > How does IB_DEVICE_LOCAL_DMA_LKEY (and ->local_dma_lkey) play into > this? Seems like that should be the preferred option if supported. > The local_dma_lkey can be used in any rdma sge that requires an lkey. It is supported for kernel-mode only. So if you're only ever going to use the lkey for IO, you really don't need a DMA_MR at all, unless you want to somehow partition up your work load by protection domain. But I claim for lkeys, the PD doesn't really protect anything since the remote peers can't use it anyway. > Interestingly enough various iWarp driver seem to support this option, > what's the story behind that? The (to me surprising) conclusion on > the list was that iWarp would always need a memory regireations that > also allows remove writes even for lkeys, but from looking at the > users of IB_DEVICE_LOCAL_DMA_LKEY / local_dma_lkey that seem to > prefer that over creating a MR. There is confusion about lkeys and rkeys with regard to iWARP. In the iWARP verbs, there is no distinction between an lkey and rkey: they are the same key, called a Steering Tag or STAG. When you create a MR, the lkey == rkey == STAG for iwarp transports. Somewhat related, but really a different issue, is that SGEs that are the target of a read need REMOTE_WRITE access flags on their STAG for iWARP. Clear as mud? :) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: 'Christoph Hellwig' [mailto:hch@infradead.org] > Sent: Tuesday, July 14, 2015 2:55 PM > To: Steve Wise > Cc: 'Jason Gunthorpe'; 'Sagi Grimberg'; 'Steve Wise'; 'Tom Talpey'; 'Doug Ledford'; 'Christoph Hellwig'; sagig@mellanox.com; > ogerlitz@mellanox.com; roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-devel@vger.kernel.org; linux- > nfs@vger.kernel.org; trond.myklebust@primarydata.com; bfields@fieldses.org; 'Oren Duer' > Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags > > On Tue, Jul 14, 2015 at 02:32:31PM -0500, Steve Wise wrote: > > You mean "should not", yea? > > > > Ok. I'll check for iWARP. But don't tell me to remove the transport-specific hacks in this series when I post it! ;) > > Just curious if there are any holes in this little scheme to deal with > the lkey mess: > > (1) make sure all drivers that currently do not set > IB_DEVICE_LOCAL_DMA_LKEY but which can safely use ib_get_dma_mr > call it underneath at device setup time, and tear it down before > removal. > (2) now ULD can check for IB_DEVICE_LOCAL_DMA_LKEY and use local_dma_lkey > in that case, or will have to perform a per-IO MR with LOCAL and > REMOTE flags if not > (3) remove insecure remote uses of ib_get_dma_mr from ULDs > (4) remove ib_get_dma_mr from the public API > Perhaps I missed some of the discussion on all this, but what is the point of #1? Are these 4 steps intended to be (bisectable) steps / commits with the goal of removing ib_get_dma_mr()? If so I still don't get #1. Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 14, 2015 at 02:58:23PM -0500, Steve Wise wrote: > The local_dma_lkey can be used in any rdma sge that requires an > lkey. No, this is where iWarp doesn't follow the generic API - a local dma lkey cannot be used with iWarp's RDMA_READ WR lkey. In effect RDMA READ requires an *rkey* (confusingly stuck into the lkey slot) for iWarp. (Right?) *THAT* is really the core difference between IB and iWarp in this area, not that the access flags are different. (cap_rmda_read_lkey_is_rkey ?) > domain. But I claim for lkeys, the PD doesn't really protect > anything since the remote peers can't use it anyway. I agree. To use a PD properly I'd have thought it should be created on a client by client basis. The risk is tiny, but client X should not be able to guess Y's RKey and then corrupt a data transfer. *Especially* on a server if client X hasn't auth'd yet .... That is what the PD is for. > There is confusion about lkeys and rkeys with regard to iWARP. In > the iWARP verbs, there is no distinction between an lkey and rkey: > they are the same key, called a Steering Tag or STAG. When you > create a MR, the lkey == rkey == STAG for iwarp transports. > Somewhat related, but really a different issue, is that SGEs that > are the target of a read need REMOTE_WRITE access flags on their > STAG for iWARP. That is the clearest explanation for the iWarp difference I've seen so far, thanks! Christoph: The take away from all this is that on iWarp RDMA_READ requires a restricted temporary MR to provide the lkey value in the WC. It cannot use local_dma_lkey. Everything else is the same between IB and iWarp: local_dma_lkey can be used as the lkey for SEND, RECV, WRITE. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/14/2015 3:32 PM, Steve Wise wrote: >> >> On Tue, Jul 14, 2015 at 02:25:50PM -0500, Steve Wise wrote: >> >>> The benefit is that we don't have to check for iWARP protocol in the >>> ULP. >> >> IB should have to pay the cost of FRMR for lkey on RDMA READ, I'm >> pretty sure you have to check for iWarp at somepoint.. >> > > You mean "should not", yea? > > Ok. I'll check for iWARP. But don't tell me to remove the transport-specific hacks in this series when I post it! ;) > FYI, in the Windows NDKPI (verbs-like kernel provider interface), there is a device attribute defined as follows: <https://msdn.microsoft.com/en-us/library/windows/hardware/hh439851(v=vs.85).aspx> NDK_ADAPTER_FLAG_RDMA_READ_SINK_NOT_REQUIRED 0x00000002 Set if the provider does not require special access rights on the sink buffer for an RDMA read request. When this flag is set, the consumer is not required to use the NDK_MR_FLAG_RDMA_READ_SINK or NDK_OP_FLAG_RDMA_READ_SINK flags when it registers sink buffers for RDMA read requests. The consumer can also use logical address mappings directly (with a token obtained with the NDK_FN_GET_PRIVILEGED_MEMORY_REGION_TOKEN function) as RDMA read sink buffers. This is similar to access to local buffers for RDMA write, send, and receive operations. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On Behalf Of Jason Gunthorpe > Sent: Tuesday, July 14, 2015 3:42 PM > To: Steve Wise > Cc: 'Christoph Hellwig'; 'Sagi Grimberg'; 'Steve Wise'; 'Tom Talpey'; 'Doug Ledford'; sagig@mellanox.com; ogerlitz@mellanox.com; > roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-devel@vger.kernel.org; linux-nfs@vger.kernel.org; > trond.myklebust@primarydata.com; bfields@fieldses.org; 'Oren Duer' > Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags > > On Tue, Jul 14, 2015 at 02:58:23PM -0500, Steve Wise wrote: > > The local_dma_lkey can be used in any rdma sge that requires an > > lkey. > > No, this is where iWarp doesn't follow the generic API - a local dma > lkey cannot be used with iWarp's RDMA_READ WR lkey. In effect RDMA > READ requires an *rkey* (confusingly stuck into the lkey slot) for > iWarp. (Right?) > Right, a local_dma_lkey is not an rkey, and iwarp requires the rkey for the read destination MR. Further that rkey needs REMOTE_WRITE. > *THAT* is really the core difference between IB and iWarp in this > area, not that the access flags are different. > It both. Because an rkey without REMOTE_WRITE would not work. > (cap_rmda_read_lkey_is_rkey ?) > IMO its more like rdma_cap_read_dest_needs_remote_write(). And maybe make it camel step too. ;) > > domain. But I claim for lkeys, the PD doesn't really protect > > anything since the remote peers can't use it anyway. > > I agree. > > To use a PD properly I'd have thought it should be created on a client > by client basis. The risk is tiny, but client X should not be able to > guess Y's RKey and then corrupt a data transfer. *Especially* on a > server if client X hasn't auth'd yet .... That is what the PD is for. > > > There is confusion about lkeys and rkeys with regard to iWARP. In > > the iWARP verbs, there is no distinction between an lkey and rkey: > > they are the same key, called a Steering Tag or STAG. When you > > create a MR, the lkey == rkey == STAG for iwarp transports. > > Somewhat related, but really a different issue, is that SGEs that > > are the target of a read need REMOTE_WRITE access flags on their > > STAG for iWARP. > > That is the clearest explanation for the iWarp difference I've seen so > far, thanks! > > Christoph: The take away from all this is that on iWarp RDMA_READ > requires a restricted temporary MR to provide the lkey value in the > WC. It cannot use local_dma_lkey. > > Everything else is the same between IB and iWarp: local_dma_lkey can > be used as the lkey for SEND, RECV, WRITE. > The source of a WRITE can use local_dma_lkey, not the destination. But this is true for IB and IW... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On Behalf Of Steve Wise > Sent: Tuesday, July 14, 2015 3:51 PM > To: 'Jason Gunthorpe' > Cc: 'Christoph Hellwig'; 'Sagi Grimberg'; 'Steve Wise'; 'Tom Talpey'; 'Doug Ledford'; sagig@mellanox.com; ogerlitz@mellanox.com; > roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-devel@vger.kernel.org; linux-nfs@vger.kernel.org; > trond.myklebust@primarydata.com; bfields@fieldses.org; 'Oren Duer' > Subject: RE: [PATCH V3 1/5] RDMA/core: Transport-independent access flags > > > > > -----Original Message----- > > From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On Behalf Of Jason Gunthorpe > > Sent: Tuesday, July 14, 2015 3:42 PM > > To: Steve Wise > > Cc: 'Christoph Hellwig'; 'Sagi Grimberg'; 'Steve Wise'; 'Tom Talpey'; 'Doug Ledford'; sagig@mellanox.com; ogerlitz@mellanox.com; > > roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-devel@vger.kernel.org; linux-nfs@vger.kernel.org; > > trond.myklebust@primarydata.com; bfields@fieldses.org; 'Oren Duer' > > Subject: Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags > > > > On Tue, Jul 14, 2015 at 02:58:23PM -0500, Steve Wise wrote: > > > The local_dma_lkey can be used in any rdma sge that requires an > > > lkey. > > > > No, this is where iWarp doesn't follow the generic API - a local dma > > lkey cannot be used with iWarp's RDMA_READ WR lkey. In effect RDMA > > READ requires an *rkey* (confusingly stuck into the lkey slot) for > > iWarp. (Right?) > > > > Right, a local_dma_lkey is not an rkey, and iwarp requires the rkey for the read destination MR. Further that rkey needs > REMOTE_WRITE. > BTW: What use is an IB rkey with no REMOTE_ flags set? Can it be used somehow differently than the associated lkey? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 14, 2015 at 04:01:04PM -0500, Steve Wise wrote: > > Right, a local_dma_lkey is not an rkey, and iwarp requires the > > rkey for the read destination MR. Further that rkey needs > > REMOTE_WRITE. > > BTW: What use is an IB rkey with no REMOTE_ flags set? Can it be > used somehow differently than the associated lkey? Don't think so. Sagi? This looks like it is just an artifact of the sloppy API that treats a rkey MR and lkey MR as the same thing - they are clearly not, and we should start talking about APIs that return lkeys or rkeys, never both (and enforcing that lkey and rkey MRS have the right ACCESS flags). Having looked at this for a bit now, I am of the view that it is very hard to use a MR as both rkey and lkey without creating some kind of security problem. At least every place in current ULPs that does this is a security problem :) So the API should prevent it, IMHO. local_dma_lkey is an excellent step, and if Sagi's MR unification patch is careful to have a lkey/rkey API entry point for the two usages we can maybe nuke this problem for good... Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > This just emphasizes why we need to converge to a single method. > > In my opinion, we already have it. > > For local registrations, ib_reg_phys_mr()/ib_get_dma_mr(). These are not > quite equivalent, btw. Personally, I would work to eliminate local registration as part of the API. > For remote registrations, ib_post_send(FRMR). I slightly agree here. IMO, the rkey should be obtained by associating the memory buffer with the QP, but not explicitly as a 'send' operation. If queuing is a concern, we could expose max_active_registrations/max_total_registrations QP attributes. I would hide the protection domain concept entirely. If needed, a provider can allocate one PD per QP, with memory registration going to the PD. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> There is confusion about lkeys and rkeys with regard to iWARP. In the > iWARP verbs, there is no distinction between an lkey and > rkey: they are the same key, called a Steering Tag or STAG. When you > create a MR, the lkey == rkey == STAG for iwarp transports. > Somewhat related, but really a different issue, is that SGEs that are the > target of a read need REMOTE_WRITE access flags on their > STAG for iWARP. > > Clear as mud? :) This may be a nit, but IMO, the use of the term 'rkey' versus 'stag' matters. They convey different ways of finding a data buffer. For example, do you locate a buffer using the stag, then verify that the offset + length fits into the target buffer? Or do you locate the buffer by address, then verify that the key matches? Consider if we allow an app to specify the rkey/stag, or reference the buffer using an offset, rather than a virtual address. This seems to be part of the difference between an lkey and an rkey. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: Hefty, Sean [mailto:sean.hefty@intel.com] > Sent: Thursday, July 23, 2015 1:53 PM > To: Steve Wise; 'Christoph Hellwig' > Cc: 'Sagi Grimberg'; 'Steve Wise'; 'Jason Gunthorpe'; 'Tom Talpey'; 'Doug Ledford'; sagig@mellanox.com; ogerlitz@mellanox.com; > roid@mellanox.com; linux-rdma@vger.kernel.org; eli@mellanox.com; target-devel@vger.kernel.org; linux-nfs@vger.kernel.org; > trond.myklebust@primarydata.com; bfields@fieldses.org; 'Oren Duer' > Subject: RE: [PATCH V3 1/5] RDMA/core: Transport-independent access flags > > > There is confusion about lkeys and rkeys with regard to iWARP. In the > > iWARP verbs, there is no distinction between an lkey and > > rkey: they are the same key, called a Steering Tag or STAG. When you > > create a MR, the lkey == rkey == STAG for iwarp transports. > > Somewhat related, but really a different issue, is that SGEs that are the > > target of a read need REMOTE_WRITE access flags on their > > STAG for iWARP. > > > > Clear as mud? :) > > This may be a nit, but IMO, the use of the term 'rkey' versus 'stag' matters. They convey different ways of finding a data buffer. For > example, do you locate a buffer using the stag, then verify that the offset + length fits into the target buffer? Yes. HW always uses the stag to locate a record that contains the stag state (valid or invalid), the access flags, the 8b key, the va_base, length, PBL describing the host pages, etc. HW validates all that before using the buffer. NOTE: An stag of 0 is the special local-dma-lkey which HW treats differently: If the stag is 0, then the address in the SGE is the bus/dma address itself and no lookup of a MR/PBL/etc is needed. Stag 0 can ONLY be used by kernel users and MUST never be accepted/used from an ingress packet and MUST never be emitted on the wire in a READ or WRITE. > Or do you locate the buffer > by address, then verify that the key matches? > This is never done. > Consider if we allow an app to specify the rkey/stag, or reference the buffer using an offset, rather than a virtual address. > > This seems to be part of the difference between an lkey and an rkey. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > This may be a nit, but IMO, the use of the term 'rkey' versus 'stag' > matters. They convey different ways of finding a data > buffer. For > > example, do you locate a buffer using the stag, then verify that the > offset + length fits into the target buffer? > > Yes. HW always uses the stag to locate a record that contains the stag > state (valid or invalid), the access flags, the 8b key, the > va_base, length, PBL describing the host pages, etc. HW validates all > that before using the buffer. NOTE: An stag of 0 is the > special local-dma-lkey which HW treats differently: If the stag is 0, then > the address in the SGE is the bus/dma address itself and > no lookup of a MR/PBL/etc is needed. Stag 0 can ONLY be used by kernel > users and MUST never be accepted/used from an ingress > packet and MUST never be emitted on the wire in a READ or WRITE. > > > Or do you locate the buffer > > by address, then verify that the key matches? > > > > This is never done. These were more rhetorical questions to highlight that stag is a better choice for the name than rkey. Lkey seems better than stag, though I'd rather see lkey removed completely. I don't see where usnic, ipath, qib, or opa technically need an lkey at all. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 23, 2015 at 11:30:08PM +0000, Hefty, Sean wrote: > I don't see where usnic, ipath, qib, or opa technically need an lkey > at all. The lkey is possibly useful if someone wants to do single op transfers larger than the S/G limit of the SQE. I haven't noticed any ULPs doing that.. qib family may not technically need a lkey, but those drivers are the only modern drivers not to support IB_DEVICE_LOCAL_DMA_LKEY. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> The lkey is possibly useful if someone wants to do single op transfers > larger than the S/G limit of the SQE. I haven't noticed any ULPs doing > that.. That changes how the buffer is identified, which gets back to my question of are we identifying local buffers by address or through some sort of iova/tag/descriptor/mr/whatever. Do we have this right in the API? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jul 24, 2015 at 12:18:41AM +0000, Hefty, Sean wrote: > > The lkey is possibly useful if someone wants to do single op transfers > > larger than the S/G limit of the SQE. I haven't noticed any ULPs doing > > that.. > > That changes how the buffer is identified, which gets back to my > question of are we identifying local buffers by address or through > some sort of iova/tag/descriptor/mr/whatever. Do we have this right > in the API? After Sagi's work, and my patchset, all ULPs will talk about local buffers in only two ways: - struct ib_sge w/ local_dma_lkey - struct scatterlist with memory registration to a simple ib_sge. (only done for iWarp RDMA READ? Maybe iSER DIX as well?) Efforts to unify them have not been successful for fairly reasonable reasons :) Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index bac3fb4..f42595c 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -1144,6 +1144,36 @@ struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags) } EXPORT_SYMBOL(ib_get_dma_mr); +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs) +{ + int access_flags = attrs; + + if (roles & RDMA_MRR_RECV) + access_flags |= IB_ACCESS_LOCAL_WRITE; + + if (roles & RDMA_MRR_WRITE_DEST) + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE; + + if (roles & RDMA_MRR_READ_DEST) { + access_flags |= IB_ACCESS_LOCAL_WRITE; + if (rdma_protocol_iwarp(pd->device, + rdma_start_port(pd->device))) + access_flags |= IB_ACCESS_REMOTE_WRITE; + } + + if (roles & RDMA_MRR_READ_SOURCE) + access_flags |= IB_ACCESS_REMOTE_READ; + + if (roles & RDMA_MRR_ATOMIC_DEST) + access_flags |= IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_ATOMIC; + + if (roles & RDMA_MRR_MW_BIND) + access_flags |= IB_ACCESS_MW_BIND; + + return access_flags; +} +EXPORT_SYMBOL(rdma_device_access_flags); + struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, struct ib_phys_buf *phys_buf_array, int num_phys_buf, diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 986fddb..da1eadf 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -2494,6 +2494,112 @@ static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt) struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags); /** + * rdma_mr_roles - possible roles an RDMA MR will be used for + * + * This allows a transport independent RDMA application to + * create MRs that are usable for all the desired roles w/o + * having to understand which access rights are needed. + */ +enum { + + /* lkey used in a ib_recv_wr sge */ + RDMA_MRR_RECV = 1, + + /* lkey used for IB_WR_SEND* ops in the ib_send_wr sge */ + RDMA_MRR_SEND = (1<<1), + + /* rkey used for IB_WR_RDMA_READ* ops in ib_send_wr wr.rdma.rkey */ + RDMA_MRR_READ_SOURCE = (1<<2), + + /* lkey used for IB_WR_RDMA_READ* ops in the ib_send_wr sge */ + RDMA_MRR_READ_DEST = (1<<3), + + /* lkey used for IB_WR_RDMA_WRITE* ops in the ib_send_wr sge */ + RDMA_MRR_WRITE_SOURCE = (1<<4), + + /* rkey used for IB_WR_RDMA_WRITE* ops in ib_send_wr wr.rdma.rkey */ + RDMA_MRR_WRITE_DEST = (1<<5), + + /* + * lkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in the ib_send_wr sge + */ + RDMA_MRR_ATOMIC_SOURCE = (1<<6), + + /* + * rkey used for IB_WR_ATOMIC/MASKED_ATOMIC* ops in ib_send_wr + * wr.atomic.rkey + */ + RDMA_MRR_ATOMIC_DEST = (1<<7), + + /* MR used for a IB_WR_MW_BIND in ib_send_wr wr.bind_mw.bind_info.mr */ + RDMA_MRR_MW_BIND = (1<<8), +}; + +/** + * rdma_mr_attributes - attributes for rdma memory regions + */ +enum { + RDMA_MRA_ZERO_BASED = IB_ZERO_BASED, + RDMA_MRA_ACCESS_ON_DEMAND = IB_ACCESS_ON_DEMAND, +}; + +/** + * rdma_device_access_flags - Returns the device-specific MR access flags. + * @pd: The protection domain associated with the memory region. + * @roles: The intended roles of the MR + * @attrs: The desired attributes of the MR + * + * Use the intended roles from @roles along with @attrs and the device + * capabilities to generate the needed access rights. + * + * Return: the ib_access_flags value needed to allocate the MR. + */ +int rdma_device_access_flags(struct ib_pd *pd, int roles, int attrs); + +/** + * rdma_get_dma_mr - Returns a memory region for system memory that is + * usable for DMA. + * @pd: The protection domain associated with the memory region. + * @roles: The intended roles of the MR + * @attrs: The desired attributes of the MR + * + * Use the intended roles from @roles along with @attrs and the device + * capabilities to define the needed access rights, and call + * ib_get_dma_mr() to allocate the MR. + * + * Note that the ib_dma_*() functions defined below must be used + * to create/destroy addresses used with the Lkey or Rkey returned + * by ib_get_dma_mr(). + * + * Return: struct ib_mr pointer upon success, or a pointer encoded errno upon + * failure. + */ +static inline struct ib_mr *rdma_get_dma_mr(struct ib_pd *pd, int roles, + int attrs) +{ + return ib_get_dma_mr(pd, rdma_device_access_flags(pd, roles, attrs)); +} + +/** + * rdma_fast_reg_access_flags - Returns the access flags needed for a fast + * register operation. + * @pd: The protection domain associated with the memory region. + * @roles: The intended roles of the MR + * @attrs: The desired attributes of the MR + * + * Use the intended roles from @roles along with @attrs and the device + * capabilities to define the needed access rights for a fast registration + * work request. + * + * Return: the ib_access_flags value needed for fast registration. + */ +static inline int rdma_fast_reg_access_flags(struct ib_pd *pd, int roles, + int attrs) +{ + return rdma_device_access_flags(pd, roles, attrs); +} + +/** * ib_dma_mapping_error - check a DMA addr for error * @dev: The device for which the dma_addr was created * @dma_addr: The DMA address to check
The semantics for MR access flags are not consistent across RDMA protocols. So rather than have applications try and glean what they need, have them pass in the intended roles and attributes for the MR to be allocated and let the RDMA core select the appropriate access flags given the roles, attributes, and device capabilities. We introduce rdma_mr_roles and rdma_mr_attributes that enumerate the possible roles and attributes for a MR. These are documented in the enums themselves. New services exported: rdma_device_access_flags() - given the intended MR roles and attributes passed in, return the ib_access_flags bits for the device. rdma_get_dma_mr() - allocate a dma mr using the applications intended MR roles and MR attributes. This uses rdma_device_access_flags(). rdma_fast_reg_access_flags() - return the ib_access_flags bits needed for a fast register WR given the applications intended MR roles and MR attributes. This uses rdma_device_access_flags(). Signed-off-by: Steve Wise <swise@opengridcomputing.com> --- drivers/infiniband/core/verbs.c | 30 +++++++++++ include/rdma/ib_verbs.h | 106 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 136 insertions(+), 0 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html