Message ID | 20190620150337.7847-1-jinpuwang@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) | expand |
Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg, Could you please provide some feedback to the IBNBD driver and the IBTRS library? So far we addressed all the requests provided by the community and continue to maintain our code up-to-date with the upstream kernel while having an extra compatibility layer for older kernels in our out-of-tree repository. I understand that SRP and NVMEoF which are in the kernel already do provide equivalent functionality for the majority of the use cases. IBNBD on the other hand is showing higher performance and more importantly includes the IBTRS - a general purpose library to establish connections and transport BIO-like read/write sg-lists over RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While I believe IBNBD does meet the kernel coding standards, it doesn't have a lot of users, while SRP and NVMEoF are widely accepted. Do you think it would make sense for us to rework our patchset and try pushing it for staging tree first, so that we can proof IBNBD is well maintained, beneficial for the eco-system, find a proper location for it within block/rdma subsystems? This would make it easier for people to try it out and would also be a huge step for us in terms of maintenance effort. The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the near future). Do you think it would make sense to rename the driver to RNBD/RTRS? Thank you, Best Regards, Danil On Thu, Jun 20, 2019 at 5:03 PM Jack Wang <jinpuwang@gmail.com> wrote: > > Hi all, > > Here is v4 of IBNBD/IBTRS patches, which have minor changes > > Changelog > --------- > v4: > o Protocol extended to transport IO priorities > o Support for Mellanox ConnectX-4/X-5 > o Minor sysfs extentions (display access mode on server side) > o Bug fixes: cleaning up sysfs folders, race on deallocation of resources > o Style fixes > > v3: > o Sparse fixes: > - le32 -> le16 conversion > - pcpu and RCU wrong declaration > - sysfs: dynamically alloc array of sockaddr structures to reduce > size of a stack frame > > o Rename sysfs folder on client and server sides to show source and > destination addresses of the connection, i.e.: > .../<session-name>/paths/<src@dst>/ > > o Remove external inclusions from Makefiles. > * https://lwn.net/Articles/756994/ > > v2: > o IBNBD: > - No legacy request IO mode, only MQ is left. > > o IBTRS: > - No FMR registration, only FR is left. > > * https://lwn.net/Articles/755075/ > > v1: > - IBTRS: load-balancing and IO fail-over using multipath features were added. > > - Major parts of the code were rewritten, simplified and overall code > size was reduced by a quarter. > > * https://lwn.net/Articles/746342/ > > v0: > - Initial submission > > * https://lwn.net/Articles/718181/ > > > Introduction > ------------- > > IBTRS (InfiniBand Transport) is a reliable high speed transport library > which allows for establishing connection between client and server > machines via RDMA. It is based on RDMA-CM, so expect also to support RoCE > and iWARP, but we mainly tested in IB environment. It is optimized to > transfer (read/write) IO blocks in the sense that it follows the BIO > semantics of providing the possibility to either write data from a > scatter-gather list to the remote side or to request ("read") data > transfer from the remote side into a given set of buffers. > > IBTRS is multipath capable and provides I/O fail-over and load-balancing > functionality, i.e. in IBTRS terminology, an IBTRS path is a set of RDMA > CMs and particular path is selected according to the load-balancing policy. > It can be used for other components not bind to IBNBD. > > > IBNBD (InfiniBand Network Block Device) is a pair of kernel modules > (client and server) that allow for remote access of a block device on > the server over IBTRS protocol. After being mapped, the remote block > devices can be accessed on the client side as local block devices. > Internally IBNBD uses IBTRS as an RDMA transport library. > > > - IBNBD/IBTRS is developed in order to map thin provisioned volumes, > thus internal protocol is simple. > - IBTRS was developed as an independent RDMA transport library, which > supports fail-over and load-balancing policies using multipath, thus > it can be used for any other IO needs rather than only for block > device. > - IBNBD/IBTRS is fast. > Old comparison results: > https://www.spinics.net/lists/linux-rdma/msg48799.html > New comparison results: see performance measurements section below. > > Key features of IBTRS transport library and IBNBD block device: > > o High throughput and low latency due to: > - Only two RDMA messages per IO. > - IMM InfiniBand messages on responses to reduce round trip latency. > - Simplified memory management: memory allocation happens once on > server side when IBTRS session is established. > > o IO fail-over and load-balancing by using multipath. According to > our test loads additional path brings ~20% of bandwidth. > > o Simple configuration of IBNBD: > - Server side is completely passive: volumes do not need to be > explicitly exported. > - Only IB port GID and device path needed on client side to map > a block device. > - A device is remapped automatically i.e. after storage reboot. > > Commits for kernel can be found here: > https://github.com/ionos-enterprise/ibnbd/tree/linux-5.2-rc3--ibnbd-v4 > The out-of-tree modules are here: > https://github.com/ionos-enterprise/ibnbd > > Vault 2017 presentation: > https://events.static.linuxfound.org/sites/events/files/slides/IBNBD-Vault-2017.pdf > > Performance measurements > ------------------------ > > o IBNBD and NVMEoRDMA > > Performance results for the v5.2-rc3 kernel > link: https://github.com/ionos-enterprise/ibnbd/tree/develop/performance/v4-v5.2-rc3 > > Roman Pen (25): > sysfs: export sysfs_remove_file_self() > ibtrs: public interface header to establish RDMA connections > ibtrs: private headers with IBTRS protocol structs and helpers > ibtrs: core: lib functions shared between client and server modules > ibtrs: client: private header with client structs and functions > ibtrs: client: main functionality > ibtrs: client: statistics functions > ibtrs: client: sysfs interface functions > ibtrs: server: private header with server structs and functions > ibtrs: server: main functionality > ibtrs: server: statistics functions > ibtrs: server: sysfs interface functions > ibtrs: include client and server modules into kernel compilation > ibtrs: a bit of documentation > ibnbd: private headers with IBNBD protocol structs and helpers > ibnbd: client: private header with client structs and functions > ibnbd: client: main functionality > ibnbd: client: sysfs interface functions > ibnbd: server: private header with server structs and functions > ibnbd: server: main functionality > ibnbd: server: functionality for IO submission to file or block dev > ibnbd: server: sysfs interface functions > ibnbd: include client and server modules into kernel compilation > ibnbd: a bit of documentation > MAINTAINERS: Add maintainer for IBNBD/IBTRS modules > > MAINTAINERS | 14 + > drivers/block/Kconfig | 2 + > drivers/block/Makefile | 1 + > drivers/block/ibnbd/Kconfig | 24 + > drivers/block/ibnbd/Makefile | 13 + > drivers/block/ibnbd/README | 315 ++ > drivers/block/ibnbd/ibnbd-clt-sysfs.c | 691 ++++ > drivers/block/ibnbd/ibnbd-clt.c | 1832 +++++++++++ > drivers/block/ibnbd/ibnbd-clt.h | 166 + > drivers/block/ibnbd/ibnbd-log.h | 59 + > drivers/block/ibnbd/ibnbd-proto.h | 378 +++ > drivers/block/ibnbd/ibnbd-srv-dev.c | 408 +++ > drivers/block/ibnbd/ibnbd-srv-dev.h | 143 + > drivers/block/ibnbd/ibnbd-srv-sysfs.c | 270 ++ > drivers/block/ibnbd/ibnbd-srv.c | 945 ++++++ > drivers/block/ibnbd/ibnbd-srv.h | 94 + > drivers/infiniband/Kconfig | 1 + > drivers/infiniband/ulp/Makefile | 1 + > drivers/infiniband/ulp/ibtrs/Kconfig | 22 + > drivers/infiniband/ulp/ibtrs/Makefile | 15 + > drivers/infiniband/ulp/ibtrs/README | 385 +++ > .../infiniband/ulp/ibtrs/ibtrs-clt-stats.c | 447 +++ > .../infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c | 514 +++ > drivers/infiniband/ulp/ibtrs/ibtrs-clt.c | 2844 +++++++++++++++++ > drivers/infiniband/ulp/ibtrs/ibtrs-clt.h | 308 ++ > drivers/infiniband/ulp/ibtrs/ibtrs-log.h | 84 + > drivers/infiniband/ulp/ibtrs/ibtrs-pri.h | 463 +++ > .../infiniband/ulp/ibtrs/ibtrs-srv-stats.c | 103 + > .../infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c | 303 ++ > drivers/infiniband/ulp/ibtrs/ibtrs-srv.c | 1998 ++++++++++++ > drivers/infiniband/ulp/ibtrs/ibtrs-srv.h | 170 + > drivers/infiniband/ulp/ibtrs/ibtrs.c | 610 ++++ > drivers/infiniband/ulp/ibtrs/ibtrs.h | 318 ++ > fs/sysfs/file.c | 1 + > 34 files changed, 13942 insertions(+) > create mode 100644 drivers/block/ibnbd/Kconfig > create mode 100644 drivers/block/ibnbd/Makefile > create mode 100644 drivers/block/ibnbd/README > create mode 100644 drivers/block/ibnbd/ibnbd-clt-sysfs.c > create mode 100644 drivers/block/ibnbd/ibnbd-clt.c > create mode 100644 drivers/block/ibnbd/ibnbd-clt.h > create mode 100644 drivers/block/ibnbd/ibnbd-log.h > create mode 100644 drivers/block/ibnbd/ibnbd-proto.h > create mode 100644 drivers/block/ibnbd/ibnbd-srv-dev.c > create mode 100644 drivers/block/ibnbd/ibnbd-srv-dev.h > create mode 100644 drivers/block/ibnbd/ibnbd-srv-sysfs.c > create mode 100644 drivers/block/ibnbd/ibnbd-srv.c > create mode 100644 drivers/block/ibnbd/ibnbd-srv.h > create mode 100644 drivers/infiniband/ulp/ibtrs/Kconfig > create mode 100644 drivers/infiniband/ulp/ibtrs/Makefile > create mode 100644 drivers/infiniband/ulp/ibtrs/README > create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c > create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c > create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt.c > create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt.h > create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-log.h > create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-pri.h > create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c > create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c > create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv.c > create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv.h > create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.c > create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.h > > -- > 2.17.1 >
On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote: > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg, > > Could you please provide some feedback to the IBNBD driver and the > IBTRS library? > So far we addressed all the requests provided by the community and > continue to maintain our code up-to-date with the upstream kernel > while having an extra compatibility layer for older kernels in our > out-of-tree repository. > I understand that SRP and NVMEoF which are in the kernel already do > provide equivalent functionality for the majority of the use cases. > IBNBD on the other hand is showing higher performance and more > importantly includes the IBTRS - a general purpose library to > establish connections and transport BIO-like read/write sg-lists over > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While > I believe IBNBD does meet the kernel coding standards, it doesn't have > a lot of users, while SRP and NVMEoF are widely accepted. Do you think > it would make sense for us to rework our patchset and try pushing it > for staging tree first, so that we can proof IBNBD is well maintained, > beneficial for the eco-system, find a proper location for it within > block/rdma subsystems? This would make it easier for people to try it > out and would also be a huge step for us in terms of maintenance > effort. > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the > near future). Do you think it would make sense to rename the driver to > RNBD/RTRS? It is better to avoid "staging" tree, because it will lack attention of relevant people and your efforts will be lost once you will try to move out of staging. We are all remembering Lustre and don't want to see it again. Back then, you was asked to provide support for performance superiority. Can you please share any numbers with us? Thanks
On Tue, Jul 09, 2019 at 02:00:36PM +0300, Leon Romanovsky wrote: > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote: > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg, > > > > Could you please provide some feedback to the IBNBD driver and the > > IBTRS library? > > So far we addressed all the requests provided by the community and > > continue to maintain our code up-to-date with the upstream kernel > > while having an extra compatibility layer for older kernels in our > > out-of-tree repository. > > I understand that SRP and NVMEoF which are in the kernel already do > > provide equivalent functionality for the majority of the use cases. > > IBNBD on the other hand is showing higher performance and more > > importantly includes the IBTRS - a general purpose library to > > establish connections and transport BIO-like read/write sg-lists over > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While > > I believe IBNBD does meet the kernel coding standards, it doesn't have > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think > > it would make sense for us to rework our patchset and try pushing it > > for staging tree first, so that we can proof IBNBD is well maintained, > > beneficial for the eco-system, find a proper location for it within > > block/rdma subsystems? This would make it easier for people to try it > > out and would also be a huge step for us in terms of maintenance > > effort. > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the > > near future). Do you think it would make sense to rename the driver to > > RNBD/RTRS? > > It is better to avoid "staging" tree, because it will lack attention of > relevant people and your efforts will be lost once you will try to move > out of staging. We are all remembering Lustre and don't want to see it > again. That's up to the developers, that had nothing to do with the fact that the code was in the staging tree. If the Lustre developers had actually done the requested work, it would have moved out of the staging tree. So if these developers are willing to do the work to get something out of staging, and into the "real" part of the kernel, I will gladly take it. But I will note that it is almost always easier to just do the work ahead of time, and merge it in "correctly" than to go from staging into the real part of the kernel. But it's up to the developers what they want to do. thanks, greg k-h
Leon Romanovsky <leon@kernel.org> 于2019年7月9日周二 下午1:00写道: > > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote: > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg, > > > > Could you please provide some feedback to the IBNBD driver and the > > IBTRS library? > > So far we addressed all the requests provided by the community and > > continue to maintain our code up-to-date with the upstream kernel > > while having an extra compatibility layer for older kernels in our > > out-of-tree repository. > > I understand that SRP and NVMEoF which are in the kernel already do > > provide equivalent functionality for the majority of the use cases. > > IBNBD on the other hand is showing higher performance and more > > importantly includes the IBTRS - a general purpose library to > > establish connections and transport BIO-like read/write sg-lists over > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While > > I believe IBNBD does meet the kernel coding standards, it doesn't have > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think > > it would make sense for us to rework our patchset and try pushing it > > for staging tree first, so that we can proof IBNBD is well maintained, > > beneficial for the eco-system, find a proper location for it within > > block/rdma subsystems? This would make it easier for people to try it > > out and would also be a huge step for us in terms of maintenance > > effort. > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the > > near future). Do you think it would make sense to rename the driver to > > RNBD/RTRS? > > It is better to avoid "staging" tree, because it will lack attention of > relevant people and your efforts will be lost once you will try to move > out of staging. We are all remembering Lustre and don't want to see it > again. > > Back then, you was asked to provide support for performance superiority. > Can you please share any numbers with us? Hi Leon, Thanks for you feedback. For performance numbers, Danil did intensive benchmark, and create some PDF with graphes here: https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3 It includes both single path results also different multipath policy results. If you have any question regarding the results, please let us know. > > Thanks Thanks Jack Wang
Greg KH <gregkh@linuxfoundation.org> 于2019年7月9日周二 下午1:17写道: > > On Tue, Jul 09, 2019 at 02:00:36PM +0300, Leon Romanovsky wrote: > > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote: > > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg, > > > > > > Could you please provide some feedback to the IBNBD driver and the > > > IBTRS library? > > > So far we addressed all the requests provided by the community and > > > continue to maintain our code up-to-date with the upstream kernel > > > while having an extra compatibility layer for older kernels in our > > > out-of-tree repository. > > > I understand that SRP and NVMEoF which are in the kernel already do > > > provide equivalent functionality for the majority of the use cases. > > > IBNBD on the other hand is showing higher performance and more > > > importantly includes the IBTRS - a general purpose library to > > > establish connections and transport BIO-like read/write sg-lists over > > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While > > > I believe IBNBD does meet the kernel coding standards, it doesn't have > > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think > > > it would make sense for us to rework our patchset and try pushing it > > > for staging tree first, so that we can proof IBNBD is well maintained, > > > beneficial for the eco-system, find a proper location for it within > > > block/rdma subsystems? This would make it easier for people to try it > > > out and would also be a huge step for us in terms of maintenance > > > effort. > > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of > > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the > > > near future). Do you think it would make sense to rename the driver to > > > RNBD/RTRS? > > > > It is better to avoid "staging" tree, because it will lack attention of > > relevant people and your efforts will be lost once you will try to move > > out of staging. We are all remembering Lustre and don't want to see it > > again. > > That's up to the developers, that had nothing to do with the fact that > the code was in the staging tree. If the Lustre developers had actually > done the requested work, it would have moved out of the staging tree. > > So if these developers are willing to do the work to get something out > of staging, and into the "real" part of the kernel, I will gladly take > it. Thanks Greg, This is encouraging, we ARE willing to do the work to get IBNBD/IBTRS merged to upstream kernel. We regularly contribute to stable kernel also upsteam, backport patches, testing stable rc release etc. We believe in opensource and the power of community. Sure, we will try to go with so called real kernel, this is also what we are doing and did in the past, but since v3, we did not receive any real feedback. We will see how thing will go. Thanks again! Jack Wang @ 1 & 1 IONOS Cloud GmbH > > But I will note that it is almost always easier to just do the work > ahead of time, and merge it in "correctly" than to go from staging into > the real part of the kernel. But it's up to the developers what they > want to do. > > thanks, > > greg k-h
On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote: > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg, > > Could you please provide some feedback to the IBNBD driver and the > IBTRS library? From my perspective you need to get people from the block community to go over this. It is the merge window right now so nobody is really looking at patches, you may need to resend it after rc1 to get attention. Jason
On Tue, Jul 09, 2019 at 01:37:39PM +0200, Jinpu Wang wrote: > Leon Romanovsky <leon@kernel.org> 于2019年7月9日周二 下午1:00写道: > > > > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote: > > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg, > > > > > > Could you please provide some feedback to the IBNBD driver and the > > > IBTRS library? > > > So far we addressed all the requests provided by the community and > > > continue to maintain our code up-to-date with the upstream kernel > > > while having an extra compatibility layer for older kernels in our > > > out-of-tree repository. > > > I understand that SRP and NVMEoF which are in the kernel already do > > > provide equivalent functionality for the majority of the use cases. > > > IBNBD on the other hand is showing higher performance and more > > > importantly includes the IBTRS - a general purpose library to > > > establish connections and transport BIO-like read/write sg-lists over > > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While > > > I believe IBNBD does meet the kernel coding standards, it doesn't have > > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think > > > it would make sense for us to rework our patchset and try pushing it > > > for staging tree first, so that we can proof IBNBD is well maintained, > > > beneficial for the eco-system, find a proper location for it within > > > block/rdma subsystems? This would make it easier for people to try it > > > out and would also be a huge step for us in terms of maintenance > > > effort. > > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of > > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the > > > near future). Do you think it would make sense to rename the driver to > > > RNBD/RTRS? > > > > It is better to avoid "staging" tree, because it will lack attention of > > relevant people and your efforts will be lost once you will try to move > > out of staging. We are all remembering Lustre and don't want to see it > > again. > > > > Back then, you was asked to provide support for performance superiority. > > Can you please share any numbers with us? > Hi Leon, > > Thanks for you feedback. > > For performance numbers, Danil did intensive benchmark, and create > some PDF with graphes here: > https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3 > > It includes both single path results also different multipath policy results. > > If you have any question regarding the results, please let us know. I kind of recall that last time the perf numbers were skewed toward IBNBD because the invalidation model for MR was wrong - did this get fixed? Jason
On Tue, Jul 9, 2019 at 2:06 PM Jason Gunthorpe <jgg@mellanox.com> wrote: > > On Tue, Jul 09, 2019 at 01:37:39PM +0200, Jinpu Wang wrote: > > Leon Romanovsky <leon@kernel.org> 于2019年7月9日周二 下午1:00写道: > > > > > > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote: > > > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg, > > > > > > > > Could you please provide some feedback to the IBNBD driver and the > > > > IBTRS library? > > > > So far we addressed all the requests provided by the community and > > > > continue to maintain our code up-to-date with the upstream kernel > > > > while having an extra compatibility layer for older kernels in our > > > > out-of-tree repository. > > > > I understand that SRP and NVMEoF which are in the kernel already do > > > > provide equivalent functionality for the majority of the use cases. > > > > IBNBD on the other hand is showing higher performance and more > > > > importantly includes the IBTRS - a general purpose library to > > > > establish connections and transport BIO-like read/write sg-lists over > > > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While > > > > I believe IBNBD does meet the kernel coding standards, it doesn't have > > > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think > > > > it would make sense for us to rework our patchset and try pushing it > > > > for staging tree first, so that we can proof IBNBD is well maintained, > > > > beneficial for the eco-system, find a proper location for it within > > > > block/rdma subsystems? This would make it easier for people to try it > > > > out and would also be a huge step for us in terms of maintenance > > > > effort. > > > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of > > > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the > > > > near future). Do you think it would make sense to rename the driver to > > > > RNBD/RTRS? > > > > > > It is better to avoid "staging" tree, because it will lack attention of > > > relevant people and your efforts will be lost once you will try to move > > > out of staging. We are all remembering Lustre and don't want to see it > > > again. > > > > > > Back then, you was asked to provide support for performance superiority. > > > Can you please share any numbers with us? > > Hi Leon, > > > > Thanks for you feedback. > > > > For performance numbers, Danil did intensive benchmark, and create > > some PDF with graphes here: > > https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3 > > > > It includes both single path results also different multipath policy results. > > > > If you have any question regarding the results, please let us know. > > I kind of recall that last time the perf numbers were skewed toward > IBNBD because the invalidation model for MR was wrong - did this get > fixed? > > Jason Thanks Jason for feedback. Can you be more specific about "the invalidation model for MR was wrong" I checked in the history of the email thread, only found "I think from the RDMA side, before we accept something like this, I'd like to hear from Christoph, Chuck or Sagi that the dataplane implementation of this is correct, eg it uses the MRs properly and invalidates at the right time, sequences with dma_ops as required, etc. " And no reply from any of you since then. Thanks, Jack
On Tue, Jul 09, 2019 at 03:15:46PM +0200, Jinpu Wang wrote: > On Tue, Jul 9, 2019 at 2:06 PM Jason Gunthorpe <jgg@mellanox.com> wrote: > > > > On Tue, Jul 09, 2019 at 01:37:39PM +0200, Jinpu Wang wrote: > > > Leon Romanovsky <leon@kernel.org> 于2019年7月9日周二 下午1:00写道: > > > > > > > > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote: > > > > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg, > > > > > > > > > > Could you please provide some feedback to the IBNBD driver and the > > > > > IBTRS library? > > > > > So far we addressed all the requests provided by the community and > > > > > continue to maintain our code up-to-date with the upstream kernel > > > > > while having an extra compatibility layer for older kernels in our > > > > > out-of-tree repository. > > > > > I understand that SRP and NVMEoF which are in the kernel already do > > > > > provide equivalent functionality for the majority of the use cases. > > > > > IBNBD on the other hand is showing higher performance and more > > > > > importantly includes the IBTRS - a general purpose library to > > > > > establish connections and transport BIO-like read/write sg-lists over > > > > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While > > > > > I believe IBNBD does meet the kernel coding standards, it doesn't have > > > > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think > > > > > it would make sense for us to rework our patchset and try pushing it > > > > > for staging tree first, so that we can proof IBNBD is well maintained, > > > > > beneficial for the eco-system, find a proper location for it within > > > > > block/rdma subsystems? This would make it easier for people to try it > > > > > out and would also be a huge step for us in terms of maintenance > > > > > effort. > > > > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of > > > > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the > > > > > near future). Do you think it would make sense to rename the driver to > > > > > RNBD/RTRS? > > > > > > > > It is better to avoid "staging" tree, because it will lack attention of > > > > relevant people and your efforts will be lost once you will try to move > > > > out of staging. We are all remembering Lustre and don't want to see it > > > > again. > > > > > > > > Back then, you was asked to provide support for performance superiority. > > > > Can you please share any numbers with us? > > > Hi Leon, > > > > > > Thanks for you feedback. > > > > > > For performance numbers, Danil did intensive benchmark, and create > > > some PDF with graphes here: > > > https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3 > > > > > > It includes both single path results also different multipath policy results. > > > > > > If you have any question regarding the results, please let us know. > > > > I kind of recall that last time the perf numbers were skewed toward > > IBNBD because the invalidation model for MR was wrong - did this get > > fixed? > > > > Jason > > Thanks Jason for feedback. > Can you be more specific about "the invalidation model for MR was wrong" MR's must be invalidated before data is handed over to the block layer. It can't leave MRs open for access and then touch the memory the MR covers. IMHO this is the most likely explanation for any performance difference from nvme.. > I checked in the history of the email thread, only found > "I think from the RDMA side, before we accept something like this, I'd > like to hear from Christoph, Chuck or Sagi that the dataplane > implementation of this is correct, eg it uses the MRs properly and > invalidates at the right time, sequences with dma_ops as required, > etc. > " > And no reply from any of you since then. This task still needs to happen.. Jason
On Tue, Jul 09, 2019 at 01:17:37PM +0200, Greg KH wrote: > On Tue, Jul 09, 2019 at 02:00:36PM +0300, Leon Romanovsky wrote: > > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote: > > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg, > > > > > > Could you please provide some feedback to the IBNBD driver and the > > > IBTRS library? > > > So far we addressed all the requests provided by the community and > > > continue to maintain our code up-to-date with the upstream kernel > > > while having an extra compatibility layer for older kernels in our > > > out-of-tree repository. > > > I understand that SRP and NVMEoF which are in the kernel already do > > > provide equivalent functionality for the majority of the use cases. > > > IBNBD on the other hand is showing higher performance and more > > > importantly includes the IBTRS - a general purpose library to > > > establish connections and transport BIO-like read/write sg-lists over > > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While > > > I believe IBNBD does meet the kernel coding standards, it doesn't have > > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think > > > it would make sense for us to rework our patchset and try pushing it > > > for staging tree first, so that we can proof IBNBD is well maintained, > > > beneficial for the eco-system, find a proper location for it within > > > block/rdma subsystems? This would make it easier for people to try it > > > out and would also be a huge step for us in terms of maintenance > > > effort. > > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of > > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the > > > near future). Do you think it would make sense to rename the driver to > > > RNBD/RTRS? > > > > It is better to avoid "staging" tree, because it will lack attention of > > relevant people and your efforts will be lost once you will try to move > > out of staging. We are all remembering Lustre and don't want to see it > > again. > > That's up to the developers, that had nothing to do with the fact that > the code was in the staging tree. If the Lustre developers had actually > done the requested work, it would have moved out of the staging tree. > > So if these developers are willing to do the work to get something out > of staging, and into the "real" part of the kernel, I will gladly take > it. Greg, It is not matter of how much *real* work developers will do, but it is a matter of guidance to do *right* thing, which is hard to achieve if people mentioned in the beginning of this thread wouldn't look on staging code. > > But I will note that it is almost always easier to just do the work > ahead of time, and merge it in "correctly" than to go from staging into > the real part of the kernel. But it's up to the developers what they > want to do. > > thanks, > > greg k-h
On Tue, Jul 9, 2019 at 3:19 PM Jason Gunthorpe <jgg@mellanox.com> wrote: > > On Tue, Jul 09, 2019 at 03:15:46PM +0200, Jinpu Wang wrote: > > On Tue, Jul 9, 2019 at 2:06 PM Jason Gunthorpe <jgg@mellanox.com> wrote: > > > > > > On Tue, Jul 09, 2019 at 01:37:39PM +0200, Jinpu Wang wrote: > > > > Leon Romanovsky <leon@kernel.org> 于2019年7月9日周二 下午1:00写道: > > > > > > > > > > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote: > > > > > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg, > > > > > > > > > > > > Could you please provide some feedback to the IBNBD driver and the > > > > > > IBTRS library? > > > > > > So far we addressed all the requests provided by the community and > > > > > > continue to maintain our code up-to-date with the upstream kernel > > > > > > while having an extra compatibility layer for older kernels in our > > > > > > out-of-tree repository. > > > > > > I understand that SRP and NVMEoF which are in the kernel already do > > > > > > provide equivalent functionality for the majority of the use cases. > > > > > > IBNBD on the other hand is showing higher performance and more > > > > > > importantly includes the IBTRS - a general purpose library to > > > > > > establish connections and transport BIO-like read/write sg-lists over > > > > > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While > > > > > > I believe IBNBD does meet the kernel coding standards, it doesn't have > > > > > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think > > > > > > it would make sense for us to rework our patchset and try pushing it > > > > > > for staging tree first, so that we can proof IBNBD is well maintained, > > > > > > beneficial for the eco-system, find a proper location for it within > > > > > > block/rdma subsystems? This would make it easier for people to try it > > > > > > out and would also be a huge step for us in terms of maintenance > > > > > > effort. > > > > > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of > > > > > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the > > > > > > near future). Do you think it would make sense to rename the driver to > > > > > > RNBD/RTRS? > > > > > > > > > > It is better to avoid "staging" tree, because it will lack attention of > > > > > relevant people and your efforts will be lost once you will try to move > > > > > out of staging. We are all remembering Lustre and don't want to see it > > > > > again. > > > > > > > > > > Back then, you was asked to provide support for performance superiority. > > > > > Can you please share any numbers with us? > > > > Hi Leon, > > > > > > > > Thanks for you feedback. > > > > > > > > For performance numbers, Danil did intensive benchmark, and create > > > > some PDF with graphes here: > > > > https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3 > > > > > > > > It includes both single path results also different multipath policy results. > > > > > > > > If you have any question regarding the results, please let us know. > > > > > > I kind of recall that last time the perf numbers were skewed toward > > > IBNBD because the invalidation model for MR was wrong - did this get > > > fixed? > > > > > > Jason > > > > Thanks Jason for feedback. > > Can you be more specific about "the invalidation model for MR was wrong" > > MR's must be invalidated before data is handed over to the block > layer. It can't leave MRs open for access and then touch the memory > the MR covers. > > IMHO this is the most likely explanation for any performance difference > from nvme.. > > > I checked in the history of the email thread, only found > > "I think from the RDMA side, before we accept something like this, I'd > > like to hear from Christoph, Chuck or Sagi that the dataplane > > implementation of this is correct, eg it uses the MRs properly and > > invalidates at the right time, sequences with dma_ops as required, > > etc. > > " > > And no reply from any of you since then. > > This task still needs to happen.. > > Jason We did extensive testing and cross-checked how iser and nvmeof does invalidation of MR, doesn't find a problem. + Chuck It will be appreciated if Christoph, Chuck, Sagi or Bart could give a check, thank you in advance. Thanks Jack
On 7/9/19 4:17 AM, Greg KH wrote: > So if these developers are willing to do the work to get something out > of staging, and into the "real" part of the kernel, I will gladly take > it. Linus once famously said "given enough eyeballs, all bugs are shallow". There are already two block-over-RDMA driver pairs upstream (NVMeOF and SRP). Accepting the IBTRS and IBNBD drivers upstream would reduce the number of users of the upstream block-over-RDMA drivers and hence would fragment the block-over-RDMA driver user base further. Additionally, I'm not yet convinced that the interesting parts of IBNBD cannot be integrated into the existing upstream drivers. So it's not clear to me whether taking the IBTRS and IBNBD drivers upstream would help the Linux user community. Bart.
Hi Danil and Jack, > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg, > > Could you please provide some feedback to the IBNBD driver and the > IBTRS library? > So far we addressed all the requests provided by the community That is not exactly correct AFAIR, My main issues which were raised before are: - IMO there isn't any justification to this ibtrs layering separation given that the only user of this is your ibnbd. Unless you are trying to submit another consumer, you should avoid adding another subsystem that is not really general purpose. - ibtrs in general is using almost no infrastructure from the existing kernel subsystems. Examples are: - tag allocation mechanism (which I'm not clear why its needed) - rdma rw abstraction similar to what we have in the core - list_next_or_null_rr_rcu ?? - few other examples sprinkled around.. Another question, from what I understand from the code, the client always rdma_writes data on writes (with imm) from a remote pool of server buffers dedicated to it. Essentially all writes are immediate (no rdma reads ever). How is that different than using send wrs to a set of pre-posted recv buffers (like all others are doing)? Is it faster? Also, given that the server pre-allocate a substantial amount of memory for each connection, is it documented the requirements from the server side? Usually kernel implementations (especially upstream ones) will avoid imposing such large longstanding memory requirements on the system by default. I don't have a firm stand on this, but wanted to highlight this as you are sending this for upstream inclusion. and > continue to maintain our code up-to-date with the upstream kernel > while having an extra compatibility layer for older kernels in our > out-of-tree repository. Overall, while I absolutely support your cause to lower your maintenance overhead by having this sit upstream, I don't see why this can be helpful to anyone else in the rdma community. If instead you can crystallize why/how ibnbd is faster than anything else, and perhaps contribute a common infrastructure piece (or enhance an existing one) such that other existing ulps can leverage, it will be a lot more compelling to include it upstream. > I understand that SRP and NVMEoF which are in the kernel already do > provide equivalent functionality for the majority of the use cases. > IBNBD on the other hand is showing higher performance and more > importantly includes the IBTRS - a general purpose library to > establish connections and transport BIO-like read/write sg-lists over > RDMA, But who needs it? Can other ulps use it or pieces of it? I keep failing to understand why is this a benefit if its specific to your ibnbd?
>> Thanks Jason for feedback. >> Can you be more specific about "the invalidation model for MR was wrong" > > MR's must be invalidated before data is handed over to the block > layer. It can't leave MRs open for access and then touch the memory > the MR covers. Jason is referring to these fixes: 2f122e4f5107 ("nvme-rdma: wait for local invalidation before completing a request") 4af7f7ff92a4 ("nvme-rdma: don't complete requests before a send work request has completed") b4b591c87f2b ("nvme-rdma: don't suppress send completions")
On Tue, Jul 09, 2019 at 12:45:57PM -0700, Sagi Grimberg wrote: > Another question, from what I understand from the code, the client > always rdma_writes data on writes (with imm) from a remote pool of > server buffers dedicated to it. Essentially all writes are immediate (no > rdma reads ever). How is that different than using send wrs to a set of > pre-posted recv buffers (like all others are doing)? Is it faster? RDMA WRITE only is generally a bit faster, and if you use a buffer pool in a smart way it is possible to get very good data packing. With SEND the number of recvq entries dictates how big the rx buffer can be, or you waste even more memory by using partial send buffers.. A scheme like this seems like a high performance idea, but on the other side, I have no idea how you could possibly manage invalidations efficiently with a shared RX buffer pool... The RXer has to push out an invalidation for the shared buffer pool MR, but we don't have protocols for partial MR invalidation. Which is back to my earlier thought that the main reason this perfoms better is because it doesn't have synchronous MR invalidation. Maybe this is fine, but it needs to be made very clear that it uses this insecure operating model to get higher performance.. Jason
Hi Leon, thanks for the feedback! On Tue, Jul 9, 2019 at 1:00 PM Leon Romanovsky <leon@kernel.org> wrote: > > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote: > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg, > > > > Could you please provide some feedback to the IBNBD driver and the > > IBTRS library? > > So far we addressed all the requests provided by the community and > > continue to maintain our code up-to-date with the upstream kernel > > while having an extra compatibility layer for older kernels in our > > out-of-tree repository. > > I understand that SRP and NVMEoF which are in the kernel already do > > provide equivalent functionality for the majority of the use cases. > > IBNBD on the other hand is showing higher performance and more > > importantly includes the IBTRS - a general purpose library to > > establish connections and transport BIO-like read/write sg-lists over > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While > > I believe IBNBD does meet the kernel coding standards, it doesn't have > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think > > it would make sense for us to rework our patchset and try pushing it > > for staging tree first, so that we can proof IBNBD is well maintained, > > beneficial for the eco-system, find a proper location for it within > > block/rdma subsystems? This would make it easier for people to try it > > out and would also be a huge step for us in terms of maintenance > > effort. > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the > > near future). Do you think it would make sense to rename the driver to > > RNBD/RTRS? > > It is better to avoid "staging" tree, because it will lack attention of > relevant people and your efforts will be lost once you will try to move > out of staging. We are all remembering Lustre and don't want to see it > again. > > Back then, you was asked to provide support for performance superiority. I have only theories of why ibnbd is showing better numbers than nvmeof: 1. The way we utilize the MQ framework in IBNBD. We promise to have queue_depth (say 512) requests on each of the num_cpus hardware queues of each device, but in fact we have only queue_depth for the whole "session" toward a given server. The moment we have queue_depth inflights we need stop the queue (on a device on a cpu) we get more requests on. We need to start them again after some requests are completed. We maintain per cpu lists of stopped HW queues, a bitmap showing which lists are not empty, etc. to wake them up in a round-robin fashion to avoid starvation of any devices. 2. We only do rdma writes with imm. A server reserves queue_depth of max_io_size buffers for a given client. The client manages those himself. Client uses imm field to tell to the server which buffer has been written (and where) and server uses the imm field to send back errno. If our max_io_size is 64K and queue_depth 512 and client only issues 4K IOs all the time, then 60*512K memory is wasted. On the other hand we do no buffer allocation/registration in io path on server side. Server sends rdma addresses and keys to those preregistered buffers on connection establishment and deallocates/unregisters them when a session is closed. That's for writes. For reads, client registers user buffers (after fr) and sends the addresses and keys to the server (with an rdma write with imm). Server rdma writes into those buffers. Client does the unregistering/invalidation and completes the request. > Can you please share any numbers with us? Apart from github (https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3) the performance results for v5.2-rc3 on two different systems can be accessed under dcd.ionos.com/ibnbd-performance-report. The page allows to filter out test scenarios interesting for comparison. > > Thanks
>> Another question, from what I understand from the code, the client >> always rdma_writes data on writes (with imm) from a remote pool of >> server buffers dedicated to it. Essentially all writes are immediate (no >> rdma reads ever). How is that different than using send wrs to a set of >> pre-posted recv buffers (like all others are doing)? Is it faster? > > RDMA WRITE only is generally a bit faster, and if you use a buffer > pool in a smart way it is possible to get very good data packing. There is no packing, its used exactly as send/recv, but with a remote buffer pool (pool of 512K buffers) and the client selects one and rdma write with imm to it. > With > SEND the number of recvq entries dictates how big the rx buffer can > be, or you waste even more memory by using partial send buffers.. This is exactly how it used here. > A scheme like this seems like a high performance idea, but on the > other side, I have no idea how you could possibly manage invalidations > efficiently with a shared RX buffer pool... There are no invalidations, this remote server pool is registered once and long lived with the session. > The RXer has to push out an invalidation for the shared buffer pool > MR, but we don't have protocols for partial MR invalidation. > > Which is back to my earlier thought that the main reason this perfoms > better is because it doesn't have synchronous MR invalidation. This issue only exists on the client side. The server never invalidates any of its buffers. > Maybe this is fine, but it needs to be made very clear that it uses > this insecure operating model to get higher performance.. I still do not understand why this should give any notice-able performance advantage.
On Wed, Jul 10, 2019 at 09:25:05AM -0700, Sagi Grimberg wrote: > > > > Another question, from what I understand from the code, the client > > > always rdma_writes data on writes (with imm) from a remote pool of > > > server buffers dedicated to it. Essentially all writes are immediate (no > > > rdma reads ever). How is that different than using send wrs to a set of > > > pre-posted recv buffers (like all others are doing)? Is it faster? > > > > RDMA WRITE only is generally a bit faster, and if you use a buffer > > pool in a smart way it is possible to get very good data packing. > > There is no packing, its used exactly as send/recv, but with a remote > buffer pool (pool of 512K buffers) and the client selects one and rdma > write with imm to it. Well that makes little sense then:) > > Maybe this is fine, but it needs to be made very clear that it uses > > this insecure operating model to get higher performance.. > > I still do not understand why this should give any notice-able > performance advantage. Usually omitting invalidations gives a healthy bump. Also, RDMA WRITE is generally faster than READ at the HW level in various ways. Jason
>> I still do not understand why this should give any notice-able >> performance advantage. > > Usually omitting invalidations gives a healthy bump. > > Also, RDMA WRITE is generally faster than READ at the HW level in > various ways. Yes, but this should be essentially identical to running nvme-rdma with 512KB of immediate-data (the nvme term is in-capsule data). In the upstream nvme target we have inline_data_size port attribute that is tunable for that (defaults to PAGE_SIZE).
Hi Sagi, thanks a lot for the analysis. I didn't know about about the inline_data_size parameter in nvmet. It is at PAGE_SIZE on our systems. Will rerun our benchmarks with echo 2097152 > /sys/kernel/config/nvmet/ports/1/param_inline_data_size echo 2097152 > /sys/kernel/config/nvmet/ports/2/param_inline_data_size before enabling the port. Best Danil. On Wed, Jul 10, 2019 at 9:11 PM Sagi Grimberg <sagi@grimberg.me> wrote: > > > >> I still do not understand why this should give any notice-able > >> performance advantage. > > > > Usually omitting invalidations gives a healthy bump. > > > > Also, RDMA WRITE is generally faster than READ at the HW level in > > various ways. > > Yes, but this should be essentially identical to running nvme-rdma > with 512KB of immediate-data (the nvme term is in-capsule data). > > In the upstream nvme target we have inline_data_size port attribute > that is tunable for that (defaults to PAGE_SIZE).
Hi Sagi, thanks a lot for the detailed reply. Answers inline below: On Tue, Jul 9, 2019 at 9:46 PM Sagi Grimberg <sagi@grimberg.me> wrote: > > Hi Danil and Jack, > > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg, > > > > Could you please provide some feedback to the IBNBD driver and the > > IBTRS library? > > So far we addressed all the requests provided by the community > > That is not exactly correct AFAIR, > > My main issues which were raised before are: > - IMO there isn't any justification to this ibtrs layering separation > given that the only user of this is your ibnbd. Unless you are > trying to submit another consumer, you should avoid adding another > subsystem that is not really general purpose. We designed ibtrs not only with the IBNBD in mind but also as the transport layer for a distributed SDS. We'd like to be able to do what ceph is capable of (automatic up/down scaling of the storage cluster, automatic recovery) but using in-kernel rdma-based IO transport drivers, thin-provisioned volume managers, etc. to keep the highest possible performance. That modest plan of ours should among others cover for the following: When using IBNBD/SRP/NVMEoF to export devices (say, thin-provisioned volumes) from server to client and building an (md-)raid on top of the imported devices on client side in order to provide for redundancy across different machines, one gets very decent throughput and low latency, since the IOs are sent in parallel to the storage machines. One downside of this setup, is that the resync traffic has to flow over the client, where the md-raid is sitting. Ideally the resync traffic should flow directly between the two "legs" (storage machines) of the raid. The server side of such a "distributed raid" capable of this direct syncing between the array members would necessarily require to have some logic on server side and hence could also sit on top of ibtrs. (To continue the analogy, the "server" side of an md-raid build on top of say two NVMEoF devices are just two block devices, which couldn't communicate with each other) All in all itbrs is a library to establish a "fat", multipath, autoreconnectable connection between two hosts on top of rdma, optimized for transport of IO traffic. > - ibtrs in general is using almost no infrastructure from the existing > kernel subsystems. Examples are: > - tag allocation mechanism (which I'm not clear why its needed) As you correctly noticed our client manages the buffers allocated and registered by the server on the connection establishment. Our tags are just a mechanism to take and release those buffers for incoming requests on client side. Since the buffers allocated by the server are to be shared between all the devices mapped from that server and all their HW queues (each having num_cpus of them) the mechanism behind get_tag/put_tag also takes care of the fairness. > - rdma rw abstraction similar to what we have in the core On the one hand we have only single IO related function: ibtrs_clt_request(READ/WRITE, session,...), which executes rdma write with imm, or requests an rdma write with imm to be executed by the server. On the other hand we provide an abstraction to establish and manage what we call "session", which consist of multiple paths (to do failover and multipath with different policies), where each path consists of num_cpu rdma connections. Once you established a session you can add or remove paths from it on the fly. In case the connection to server is lost, the client does periodic attempts to reconnect automatically. On the server side you get just sg-lists with a direction READ or WRITE as requested by the client. We designed this interface not only as the minimum required to build a block device on top of rdma but also with a distributed raid in mind. > - list_next_or_null_rr_rcu ?? We use that for multipath. The macro (and more importantly the way we use it) has been reviewed by Linus and quit closely by Paul E. McKenney. AFAIR the conclusion was that Romans implementation is correct, but too tricky to use correctly in order to be included into kernel as a public interface. See https://lkml.org/lkml/2018/5/18/659 > - few other examples sprinkled around.. To my best knowledge we addressed everything we got comments on and will definitely do so in the future. > Another question, from what I understand from the code, the client > always rdma_writes data on writes (with imm) from a remote pool of > server buffers dedicated to it. Essentially all writes are immediate (no > rdma reads ever). How is that different than using send wrs to a set of > pre-posted recv buffers (like all others are doing)? Is it faster? At the very beginning of the project we did some measurements and saw, that it is faster. I'm not sure if this is still true, since the hardware and the drivers and rdma subsystem did change in that time. Also it seemed to make the code simpler. > Also, given that the server pre-allocate a substantial amount of memory > for each connection, is it documented the requirements from the server > side? Usually kernel implementations (especially upstream ones) will > avoid imposing such large longstanding memory requirements on the system > by default. I don't have a firm stand on this, but wanted to highlight > this as you are sending this for upstream inclusion. We definitely need to stress that somewhere. Will include into readme and add to the cover letter next time. Our memory management is indeed basically absent in favor of performance: The server reserves queue_depth of say 512K buffers. Each buffer is used by client for single IO only, no matter how big the request is. So if client only issues 4K IOs, we do waste 508*queue_depth K of memory. We were aiming for lowest possible latency from the beginning. It is probably possible to implement some clever allocator on the server side which wouldn't affect the performance a lot. > > and > > continue to maintain our code up-to-date with the upstream kernel > > while having an extra compatibility layer for older kernels in our > > out-of-tree repository. > > Overall, while I absolutely support your cause to lower your maintenance > overhead by having this sit upstream, I don't see why this can be > helpful to anyone else in the rdma community. If instead you can > crystallize why/how ibnbd is faster than anything else, and perhaps > contribute a common infrastructure piece (or enhance an existing one) > such that other existing ulps can leverage, it will be a lot more > compelling to include it upstream. > > > I understand that SRP and NVMEoF which are in the kernel already do > > provide equivalent functionality for the majority of the use cases. > > IBNBD on the other hand is showing higher performance and more > > importantly includes the IBTRS - a general purpose library to > > establish connections and transport BIO-like read/write sg-lists over > > RDMA, > > But who needs it? Can other ulps use it or pieces of it? I keep failing > to understand why is this a benefit if its specific to your ibnbd? See above and please ask if you have more questions to this. Thank you, Danil.
>> My main issues which were raised before are: >> - IMO there isn't any justification to this ibtrs layering separation >> given that the only user of this is your ibnbd. Unless you are >> trying to submit another consumer, you should avoid adding another >> subsystem that is not really general purpose. > We designed ibtrs not only with the IBNBD in mind but also as the > transport layer for a distributed SDS. We'd like to be able to do what > ceph is capable of (automatic up/down scaling of the storage cluster, > automatic recovery) but using in-kernel rdma-based IO transport > drivers, thin-provisioned volume managers, etc. to keep the highest > possible performance. Sounds lovely, but still very much bound to your ibnbd. And that part is not included in the patch set, so I still don't see why should this be considered as a "generic" transport subsystem (it clearly isn't). > All in all itbrs is a library to establish a "fat", multipath, > autoreconnectable connection between two hosts on top of rdma, > optimized for transport of IO traffic. That is also dictating a wire-protocol which makes it useless to pretty much any other consumer. Personally, I don't see how this library would ever be used outside of your ibnbd. >> - ibtrs in general is using almost no infrastructure from the existing >> kernel subsystems. Examples are: >> - tag allocation mechanism (which I'm not clear why its needed) > As you correctly noticed our client manages the buffers allocated and > registered by the server on the connection establishment. Our tags are > just a mechanism to take and release those buffers for incoming > requests on client side. Since the buffers allocated by the server are > to be shared between all the devices mapped from that server and all > their HW queues (each having num_cpus of them) the mechanism behind > get_tag/put_tag also takes care of the fairness. We have infrastructure for this, sbitmaps. >> - rdma rw abstraction similar to what we have in the core > On the one hand we have only single IO related function: > ibtrs_clt_request(READ/WRITE, session,...), which executes rdma write > with imm, or requests an rdma write with imm to be executed by the > server. For sure you can enhance the rw API to have imm support? > On the other hand we provide an abstraction to establish and > manage what we call "session", which consist of multiple paths (to do > failover and multipath with different policies), where each path > consists of num_cpu rdma connections. That's fine, but it doesn't mean that it also needs to re-write infrastructure that we already have. > Once you established a session > you can add or remove paths from it on the fly. In case the connection > to server is lost, the client does periodic attempts to reconnect > automatically. On the server side you get just sg-lists with a > direction READ or WRITE as requested by the client. We designed this > interface not only as the minimum required to build a block device on > top of rdma but also with a distributed raid in mind. I suggest you take a look at the rw API and use that in your transport. >> Another question, from what I understand from the code, the client >> always rdma_writes data on writes (with imm) from a remote pool of >> server buffers dedicated to it. Essentially all writes are immediate (no >> rdma reads ever). How is that different than using send wrs to a set of >> pre-posted recv buffers (like all others are doing)? Is it faster? > At the very beginning of the project we did some measurements and saw, > that it is faster. I'm not sure if this is still true Its not significantly faster (can't imagine why it would be). What could make a difference is probably the fact that you never do rdma reads for I/O writes which might be better. Also perhaps the fact that you normally don't wait for send completions before completing I/O (which is broken), and the fact that you batch recv operations. I would be interested to understand what indeed makes ibnbd run faster though. >> Also, given that the server pre-allocate a substantial amount of memory >> for each connection, is it documented the requirements from the server >> side? Usually kernel implementations (especially upstream ones) will >> avoid imposing such large longstanding memory requirements on the system >> by default. I don't have a firm stand on this, but wanted to highlight >> this as you are sending this for upstream inclusion. > We definitely need to stress that somewhere. Will include into readme > and add to the cover letter next time. Our memory management is indeed > basically absent in favor of performance: The server reserves > queue_depth of say 512K buffers. Each buffer is used by client for > single IO only, no matter how big the request is. So if client only > issues 4K IOs, we do waste 508*queue_depth K of memory. We were aiming > for lowest possible latency from the beginning. It is probably > possible to implement some clever allocator on the server side which > wouldn't affect the performance a lot. Or you can fallback to rdma_read like the rest of the ulps.
Hi Sagi, > >> Another question, from what I understand from the code, the client > >> always rdma_writes data on writes (with imm) from a remote pool of > >> server buffers dedicated to it. Essentially all writes are immediate (no > >> rdma reads ever). How is that different than using send wrs to a set of > >> pre-posted recv buffers (like all others are doing)? Is it faster? > > At the very beginning of the project we did some measurements and saw, > > that it is faster. I'm not sure if this is still true > > Its not significantly faster (can't imagine why it would be). > What could make a difference is probably the fact that you never > do rdma reads for I/O writes which might be better. Also perhaps the > fact that you normally don't wait for send completions before completing > I/O (which is broken), and the fact that you batch recv operations. I don't know how do you come to the conclusion we don't wait for send completion before completing IO. We do chain wr on successfull read request from server, see funtion rdma_write_sg, 318 static int rdma_write_sg(struct ibtrs_srv_op *id) 319 { 320 struct ibtrs_srv_sess *sess = to_srv_sess(id->con->c.sess); 321 dma_addr_t dma_addr = sess->dma_addr[id->msg_id]; 322 struct ibtrs_srv *srv = sess->srv; 323 struct ib_send_wr inv_wr, imm_wr; 324 struct ib_rdma_wr *wr = NULL; snip 333 need_inval = le16_to_cpu(id->rd_msg->flags) & IBTRS_MSG_NEED_INVAL_F; snip 357 wr->wr.wr_cqe = &io_comp_cqe; 358 wr->wr.sg_list = list; 359 wr->wr.num_sge = 1; 360 wr->remote_addr = le64_to_cpu(id->rd_msg->desc[i].addr); 361 wr->rkey = le32_to_cpu(id->rd_msg->desc[i].key); snip 368 if (i < (sg_cnt - 1)) 369 wr->wr.next = &id->tx_wr[i + 1].wr; 370 else if (need_inval) 371 wr->wr.next = &inv_wr; 372 else 373 wr->wr.next = &imm_wr; 374 375 wr->wr.opcode = IB_WR_RDMA_WRITE; 376 wr->wr.ex.imm_data = 0; 377 wr->wr.send_flags = 0; snip 386 if (need_inval) { 387 inv_wr.next = &imm_wr; 388 inv_wr.wr_cqe = &io_comp_cqe; 389 inv_wr.sg_list = NULL; 390 inv_wr.num_sge = 0; 391 inv_wr.opcode = IB_WR_SEND_WITH_INV; 392 inv_wr.send_flags = 0; 393 inv_wr.ex.invalidate_rkey = rkey; 394 } 395 imm_wr.next = NULL; 396 imm_wr.wr_cqe = &io_comp_cqe; 397 imm_wr.sg_list = NULL; 398 imm_wr.num_sge = 0; 399 imm_wr.opcode = IB_WR_RDMA_WRITE_WITH_IMM; 400 imm_wr.send_flags = flags; 401 imm_wr.ex.imm_data = cpu_to_be32(ibtrs_to_io_rsp_imm(id->msg_id, 402 0, need_inval)); 403 when we need to do invalidation of remote memory, there will chain WR togather, last 2 are inv_wr, and imm_wr. imm_wr is the last one, this is important, due to the fact RC QP are ordered, we know when when we received IB_WC_RECV_RDMA_WITH_IMM and w_inval is true, hardware should already finished it's job to invalidate the MR. If server fails to invalidate, we will do local invalidation, and wait for completion. On client side 284 static void complete_rdma_req(struct ibtrs_clt_io_req *req, int errno, 285 bool notify, bool can_wait) 286 { 287 struct ibtrs_clt_con *con = req->con; 288 struct ibtrs_clt_sess *sess; 289 struct ibtrs_clt *clt; 290 int err; 291 292 if (WARN_ON(!req->in_use)) 293 return; 294 if (WARN_ON(!req->con)) 295 return; 296 sess = to_clt_sess(con->c.sess); 297 clt = sess->clt; 298 299 if (req->sg_cnt) { 300 if (unlikely(req->dir == DMA_FROM_DEVICE && req->need_inv)) { 301 /* 302 * We are here to invalidate RDMA read requests 303 * ourselves. In normal scenario server should 304 * send INV for all requested RDMA reads, but 305 * we are here, thus two things could happen: 306 * 307 * 1. this is failover, when errno != 0 308 * and can_wait == 1, 309 * 310 * 2. something totally bad happened and 311 * server forgot to send INV, so we 312 * should do that ourselves. 313 */ 314 315 if (likely(can_wait)) { 316 req->need_inv_comp = true; 317 } else { 318 /* This should be IO path, so always notify */ 319 WARN_ON(!notify); 320 /* Save errno for INV callback */ 321 req->inv_errno = errno; 322 } 323 324 err = ibtrs_inv_rkey(req); 325 if (unlikely(err)) { 326 ibtrs_err(sess, "Send INV WR key=%#x: %d\n", 327 req->mr->rkey, err); 328 } else if (likely(can_wait)) { 329 wait_for_completion(&req->inv_comp); 330 } else { 330 } else { 331 /* 332 * Something went wrong, so request will be 333 * completed from INV callback. 334 */ 335 WARN_ON_ONCE(1); 336 337 return; 338 } 339 } 340 ib_dma_unmap_sg(sess->s.dev->ib_dev, req->sglist, 341 req->sg_cnt, req->dir); 342 } 343 if (sess->stats.enable_rdma_lat) 344 ibtrs_clt_update_rdma_lat(&sess->stats, 345 req->dir == DMA_FROM_DEVICE, 346 jiffies_to_msecs(jiffies - req->start_jiffies)); 347 ibtrs_clt_decrease_inflight(&sess->stats); 348 349 req->in_use = false; 350 req->con = NULL; 351 352 if (notify) 353 req->conf(req->priv, errno); 354 } 356 static void process_io_rsp(struct ibtrs_clt_sess *sess, u32 msg_id, 357 s16 errno, bool w_inval) 358 { 359 struct ibtrs_clt_io_req *req; 360 361 if (WARN_ON(msg_id >= sess->queue_depth)) 362 return; 363 364 req = &sess->reqs[msg_id]; 365 /* Drop need_inv if server responsed with invalidation */ 366 req->need_inv &= !w_inval; 367 complete_rdma_req(req, errno, true, false); 368 } Hope this clears the doubt. Regards, Jack
On Fri, Jul 12, 2019 at 2:22 AM Sagi Grimberg <sagi@grimberg.me> wrote: > > > >> My main issues which were raised before are: > >> - IMO there isn't any justification to this ibtrs layering separation > >> given that the only user of this is your ibnbd. Unless you are > >> trying to submit another consumer, you should avoid adding another > >> subsystem that is not really general purpose. > > We designed ibtrs not only with the IBNBD in mind but also as the > > transport layer for a distributed SDS. We'd like to be able to do what > > ceph is capable of (automatic up/down scaling of the storage cluster, > > automatic recovery) but using in-kernel rdma-based IO transport > > drivers, thin-provisioned volume managers, etc. to keep the highest > > possible performance. > > Sounds lovely, but still very much bound to your ibnbd. And that part > is not included in the patch set, so I still don't see why should this > be considered as a "generic" transport subsystem (it clearly isn't). Having IBTRS sit on a storage enables that storage to communicate with other storages (forward requests, request read from other storages i.e. for sync traffic). IBTRS is generic in the sense that it removes the strict separation into initiator (converting BIOs into some hardware specific protocol messages) and target (which forwards those messages to some local device supporting that protocol). It appears less generic to me to talk SCSI or NVME between storages if some storages have SCSI, other NVME disks or LVM volumes, or mixed setup. IBTRS allows to just send or request read of an sg-list between machines over rdma - the very minimum required to transport a BIO. It would in-deed support our case with the library if we would propose at least two users of it. We now only have a very early stage prototype capable of organizing storages in pools, multiplexing io between different storages, etc. sitting on top of ibtrs, it's not functional yet. On the other hand ibnbd with ibtrs alone already make over 10000 lines. > > All in all itbrs is a library to establish a "fat", multipath, > > autoreconnectable connection between two hosts on top of rdma, > > optimized for transport of IO traffic. > > That is also dictating a wire-protocol which makes it useless to pretty > much any other consumer. Personally, I don't see how this library > would ever be used outside of your ibnbd. Its true, IBTRS also imposes a protocol for connection establishment and IO path. I think at least the IO part we did reduce to a bare minimum: 350 * Write * 351 352 1. When processing a write request client selects one of the memory chunks 353 on the server side and rdma writes there the user data, user header and the 354 IBTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only 355 contains size of the user header. The client tells the server which chunk has 356 been accessed and at what offset the IBTRS_MSG_RDMA_WRITE can be found by 357 using the IMM field. 358 359 2. When confirming a write request server sends an "empty" rdma message with 360 an immediate field. The 32 bit field is used to specify the outstanding 361 inflight IO and for the error code. 362 363 CLT SRV 364 usr_data + usr_hdr + ibtrs_msg_rdma_write -----------------> [IBTRS_IO_REQ_IMM] 365 [IBTRS_IO_RSP_IMM] <----------------- (id + errno) 366 367 * Read * 368 369 1. When processing a read request client selects one of the memory chunks 370 on the server side and rdma writes there the user header and the 371 IBTRS_MSG_RDMA_READ message. This message contains the type (read), size of 372 the user header, flags (specifying if memory invalidation is necessary) and the 373 list of addresses along with keys for the data to be read into. 374 375 2. When confirming a read request server transfers the requested data first, 376 attaches an invalidation message if requested and finally an "empty" rdma 377 message with an immediate field. The 32 bit field is used to specify the 378 outstanding inflight IO and the error code. 379 380 CLT SRV 381 usr_hdr + ibtrs_msg_rdma_read --------------> [IBTRS_IO_REQ_IMM] 382 [IBTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno) 383 or in case client requested invalidation: 384 [IBTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno) > >> - ibtrs in general is using almost no infrastructure from the existing > >> kernel subsystems. Examples are: > >> - tag allocation mechanism (which I'm not clear why its needed) > > As you correctly noticed our client manages the buffers allocated and > > registered by the server on the connection establishment. Our tags are > > just a mechanism to take and release those buffers for incoming > > requests on client side. Since the buffers allocated by the server are > > to be shared between all the devices mapped from that server and all > > their HW queues (each having num_cpus of them) the mechanism behind > > get_tag/put_tag also takes care of the fairness. > > We have infrastructure for this, sbitmaps. AFAIR Roman did try to use sbitmap but found no benefits in terms of readability or number of lines: " What is left unchanged on IBTRS side but was suggested to modify: - Bart suggested to use sbitmap instead of calling find_first_zero_bit() and friends. I found calling pure bit API is more explicit in comparison to sbitmap - there is no need in using sbitmap_queue and all the power of wait queues, no benefits in terms of LoC as well." https://lwn.net/Articles/756994/ If sbitmap is a must for our use case from the infrastructure point of view, we will reiterate on it. > > >> - rdma rw abstraction similar to what we have in the core > > On the one hand we have only single IO related function: > > ibtrs_clt_request(READ/WRITE, session,...), which executes rdma write > > with imm, or requests an rdma write with imm to be executed by the > > server. > > For sure you can enhance the rw API to have imm support? I'm not familiar with the architectural intention behind rw.c. Extending the API with the support of imm field is (I guess) doable. > > On the other hand we provide an abstraction to establish and > > manage what we call "session", which consist of multiple paths (to do > > failover and multipath with different policies), where each path > > consists of num_cpu rdma connections. > > That's fine, but it doesn't mean that it also needs to re-write > infrastructure that we already have. Do you refer to rw.c? > > Once you established a session > > you can add or remove paths from it on the fly. In case the connection > > to server is lost, the client does periodic attempts to reconnect > > automatically. On the server side you get just sg-lists with a > > direction READ or WRITE as requested by the client. We designed this > > interface not only as the minimum required to build a block device on > > top of rdma but also with a distributed raid in mind. > > I suggest you take a look at the rw API and use that in your transport. We will look into rw.c. Do you suggest we move the multipath and the multiple QPs per path and connection establishment on *top* of it or *into* it? > >> Another question, from what I understand from the code, the client > >> always rdma_writes data on writes (with imm) from a remote pool of > >> server buffers dedicated to it. Essentially all writes are immediate (no > >> rdma reads ever). How is that different than using send wrs to a set of > >> pre-posted recv buffers (like all others are doing)? Is it faster? > > At the very beginning of the project we did some measurements and saw, > > that it is faster. I'm not sure if this is still true > > Its not significantly faster (can't imagine why it would be). > What could make a difference is probably the fact that you never > do rdma reads for I/O writes which might be better. Also perhaps the > fact that you normally don't wait for send completions before completing > I/O (which is broken), and the fact that you batch recv operations. > > I would be interested to understand what indeed makes ibnbd run faster > though. Yes, we would like to understand this too. I will try increasing the inline_data_size on nvme in our benchmarks as the next step to check if this influences the results. > >> Also, given that the server pre-allocate a substantial amount of memory > >> for each connection, is it documented the requirements from the server > >> side? Usually kernel implementations (especially upstream ones) will > >> avoid imposing such large longstanding memory requirements on the system > >> by default. I don't have a firm stand on this, but wanted to highlight > >> this as you are sending this for upstream inclusion. > > We definitely need to stress that somewhere. Will include into readme > > and add to the cover letter next time. Our memory management is indeed > > basically absent in favor of performance: The server reserves > > queue_depth of say 512K buffers. Each buffer is used by client for > > single IO only, no matter how big the request is. So if client only > > issues 4K IOs, we do waste 508*queue_depth K of memory. We were aiming > > for lowest possible latency from the beginning. It is probably > > possible to implement some clever allocator on the server side which > > wouldn't affect the performance a lot. > > Or you can fallback to rdma_read like the rest of the ulps. We currently have a single round trip for every write IO: write + ack. Wouldn't switching to rdma_read make 2 round trips out of it: command + rdma_read + ack?
> Hi Sagi, > >>>> Another question, from what I understand from the code, the client >>>> always rdma_writes data on writes (with imm) from a remote pool of >>>> server buffers dedicated to it. Essentially all writes are immediate (no >>>> rdma reads ever). How is that different than using send wrs to a set of >>>> pre-posted recv buffers (like all others are doing)? Is it faster? >>> At the very beginning of the project we did some measurements and saw, >>> that it is faster. I'm not sure if this is still true >> >> Its not significantly faster (can't imagine why it would be). >> What could make a difference is probably the fact that you never >> do rdma reads for I/O writes which might be better. Also perhaps the >> fact that you normally don't wait for send completions before completing >> I/O (which is broken), and the fact that you batch recv operations. > > I don't know how do you come to the conclusion we don't wait for send > completion before completing IO. > > We do chain wr on successfull read request from server, see funtion > rdma_write_sg, I was referring to the client side
Sagi Grimberg <sagi@grimberg.me> 于2019年7月12日周五 下午9:40写道: > > > > Hi Sagi, > > > >>>> Another question, from what I understand from the code, the client > >>>> always rdma_writes data on writes (with imm) from a remote pool of > >>>> server buffers dedicated to it. Essentially all writes are immediate (no > >>>> rdma reads ever). How is that different than using send wrs to a set of > >>>> pre-posted recv buffers (like all others are doing)? Is it faster? > >>> At the very beginning of the project we did some measurements and saw, > >>> that it is faster. I'm not sure if this is still true > >> > >> Its not significantly faster (can't imagine why it would be). > >> What could make a difference is probably the fact that you never > >> do rdma reads for I/O writes which might be better. Also perhaps the > >> fact that you normally don't wait for send completions before completing > >> I/O (which is broken), and the fact that you batch recv operations. > > > > I don't know how do you come to the conclusion we don't wait for send > > completion before completing IO. > > > > We do chain wr on successfull read request from server, see funtion > > rdma_write_sg, > > I was referring to the client side Hi Sagi, I checked the 3 commits you mentioned in earlier thread again, I now get your point. You meant the behavior following commits try to fix. 4af7f7ff92a4 ("nvme-rdma: don't complete requests before a send work request has completed") b4b591c87f2b ("nvme-rdma: don't suppress send completions") In this sense, ibtrs client side are not waiting for the completions for RDMA WRITE WR to finish. But we did it right for local invalidation. I checked SRP/iser, they are not even wait for local invalidation, no signal flag set. If it's a problem, we should fix them too, maybe more. My question is do you see the behavior (HCA retry send due to drop ack ) in the field, is it possible to reproduce? Thanks, Jack
Hi Sagi, thanks a lot for the information. We are doing the right thing regarding the invalidation (your 2f122e4f5107), but we do use unsignalled sends and need to fix that. Please correct me if I'm wrong: The patches (b4b591c87f2b, b4b591c87f2b) fix the problem that if the ack from target is lost for some reason, the initiators HCA will resend it even after the request is completed. But doesn't the same problem persist also other way around: for the lost acks from client? I mean, target is did a send for the "read" IOs; client completed the request (after invalidation, refcount dropped to 0, etc), but the ack is not delivered to the HCA of the target, so the target will also resend it. This seems unfixable, since the client can't possible know if the server received his ack or not? Doesn't the problem go away, if rdma_conn_param.retry_count is just set to 0? Thanks for your help, Best, Danil. On Tue, Jul 9, 2019 at 11:27 PM Sagi Grimberg <sagi@grimberg.me> wrote: > > > >> Thanks Jason for feedback. > >> Can you be more specific about "the invalidation model for MR was wrong" > > > > MR's must be invalidated before data is handed over to the block > > layer. It can't leave MRs open for access and then touch the memory > > the MR covers. > > Jason is referring to these fixes: > 2f122e4f5107 ("nvme-rdma: wait for local invalidation before completing > a request") > 4af7f7ff92a4 ("nvme-rdma: don't complete requests before a send work > request has completed") > b4b591c87f2b ("nvme-rdma: don't suppress send completions")