Message ID | 1509295101-14081-3-git-send-email-idanb@mellanox.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
On Sun, Oct 29, 2017 at 06:38:21PM +0200, idanb@mellanox.com wrote: > +static bool remote_invalidation = true; > +module_param(remote_invalidation, bool, 0444); > +MODULE_PARM_DESC(remote_invalidation, > + "request remote invalidation from subsystem (default: true)"); Please no module options. If your device has a performance anomaly that makes SEND_WITH_INVALIDATE slower than local invalidation, then we need to talk about having a new verbs flag for this situation so all the ULPs can avoid using SEND_WITH_INVALIDATE.. Sagi, is this really an apples to apples comparison? When I was talking with Chuck on NFS there were some various issues on that side that made the local invalidate run faster, but it wasn't actually working properly. Does nvme wait for the local invalidate to finish before moving on? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Oct 29, 2017, at 12:38 PM, idanb@mellanox.com wrote: > > From: Idan Burstein <idanb@mellanox.com> > > NVMe over Fabrics in its secure "register_always" mode > registers and invalidates the user buffer upon each IO. > The protocol enables the host to request the susbsystem > to use SEND WITH INVALIDATE operation while returning the > response capsule and invalidate the local key > (remote_invalidation). > In some HW implementations, the local network adapter may > perform better while using local invalidation operations. > > The results below show that running with local invalidation > rather then remote invalidation improve the iops one could > achieve by using the ConnectX-5Ex network adapter by x1.36 factor. > Nevertheless, using local invalidation induce more CPU overhead > than enabling the target to invalidate remotly, therefore, > because there is a CPU% vs IOPs limit tradeoff we propose to > have module parameter to define whether to request remote > invalidation or not. > > The following results were taken against a single nvme over fabrics > subsystem with a single namespace backed by null_blk: > > Block Size s/g reg_wr inline reg_wr inline reg_wr + local inv > ++++++++++++ ++++++++++++++ ++++++++++++++++ +++++++++++++++++++++++++++ > 512B 1446.6K/8.57% 5224.2K/76.21% 7143.3K/79.72% > 1KB 1390.6K/8.5% 4656.7K/71.69% 5860.6K/55.45% > 2KB 1343.8K/8.6% 3410.3K/38.96% 4106.7K/55.82% > 4KB 1254.8K/8.39% 2033.6K/15.86% 2165.3K/17.48% > 8KB 1079.5K/7.7% 1143.1K/7.27% 1158.2K/7.33% > 16KB 603.4K/3.64% 593.8K/3.4% 588.9K/3.77% > 32KB 294.8K/2.04% 293.7K/1.98% 294.4K/2.93% > 64KB 138.2K/1.32% 141.6K/1.26% 135.6K/1.34% Units reported here are KIOPS and %CPU ? What was the benchmark? Was any root cause analysis done to understand why the IOPS rate changes without RI? Reduction in avg RTT? Fewer long- running outliers? Lock contention in the ULP? I am curious enough to add a similar setting to NFS/RDMA, now that I have mlx5 devices. > Signed-off-by: Max Gurtovoy <maxg@mellanox.com> > Signed-off-by: Idan Burstein <idanb@mellanox.com> > --- > drivers/nvme/host/rdma.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c > index 92a03ff..7f8225d 100644 > --- a/drivers/nvme/host/rdma.c > +++ b/drivers/nvme/host/rdma.c > @@ -146,6 +146,11 @@ static inline struct nvme_rdma_ctrl *to_rdma_ctrl(struct nvme_ctrl *ctrl) > MODULE_PARM_DESC(register_always, > "Use memory registration even for contiguous memory regions"); > > +static bool remote_invalidation = true; > +module_param(remote_invalidation, bool, 0444); > +MODULE_PARM_DESC(remote_invalidation, > + "request remote invalidation from subsystem (default: true)"); The use of a module parameter would be awkward in systems that have a heterogenous collection of HCAs. > + > static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id, > struct rdma_cm_event *event); > static void nvme_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc); > @@ -1152,8 +1157,9 @@ static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue, > sg->addr = cpu_to_le64(req->mr->iova); > put_unaligned_le24(req->mr->length, sg->length); > put_unaligned_le32(req->mr->rkey, sg->key); > - sg->type = (NVME_KEY_SGL_FMT_DATA_DESC << 4) | > - NVME_SGL_FMT_INVALIDATE; > + sg->type = NVME_KEY_SGL_FMT_DATA_DESC << 4; > + if (remote_invalidation) > + sg->type |= NVME_SGL_FMT_INVALIDATE; > > return 0; > } > -- > 1.8.3.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Idan, > NVMe over Fabrics in its secure "register_always" mode > registers and invalidates the user buffer upon each IO. > The protocol enables the host to request the susbsystem > to use SEND WITH INVALIDATE operation while returning the > response capsule and invalidate the local key > (remote_invalidation). > In some HW implementations, the local network adapter may > perform better while using local invalidation operations. This is a device quirk and if you want to optimize for this you need to expose it via verbs interface. We won't add a module parameter for implementation specific stuff. Note that we currently need to rework the case where we use local invalidates and properly wait for them to complete. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Sagi, is this really an apples to apples comparison? No > When I was talking with Chuck on NFS there were some various issues on > that side that made the local invalidate run faster, but it wasn't > actually working properly. > > Does nvme wait for the local invalidate to finish before moving on? Not really, and we should, I'll get to it. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/30/2017 10:14 AM, Sagi Grimberg wrote: > >> Sagi, is this really an apples to apples comparison? > > No > >> When I was talking with Chuck on NFS there were some various issues on >> that side that made the local invalidate run faster, but it wasn't >> actually working properly. >> >> Does nvme wait for the local invalidate to finish before moving on? > > Not really, and we should, I'll get to it. I agree we should fix this, nevertheless I don't expect this to create an IO rate issue for long queue depths. Indeed we will have a tradeoff here of latency vs rate. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/30/2017 10:11 AM, Sagi Grimberg wrote: > Idan, > >> NVMe over Fabrics in its secure "register_always" mode >> registers and invalidates the user buffer upon each IO. >> The protocol enables the host to request the susbsystem >> to use SEND WITH INVALIDATE operation while returning the >> response capsule and invalidate the local key >> (remote_invalidation). >> In some HW implementations, the local network adapter may >> perform better while using local invalidation operations. > > This is a device quirk and if you want to optimize for this > you need to expose it via verbs interface. We won't add a module > parameter for implementation specific stuff. After our discussion yesterday I was thinking it may be better to hide it within the nvme host driver while having a messages size threshold for using send_with_invalidate. To make it device agnostic we will need to come out with an idea of how to expose such limitation. > > Note that we currently need to rework the case where we use local > invalidates and properly wait for them to complete. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I agree we should fix this, nevertheless I don't expect this to create > an IO rate issue for long queue depths. > Indeed we will have a tradeoff here of latency vs rate. I suggest we hold back with this one for now, I have fixes for the local invalidate flow. Once we get them in we can see if this approach makes a difference, and if so, we need to expose it through a generic mechanism. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/30/2017 11:44 AM, Sagi Grimberg wrote: > >> I agree we should fix this, nevertheless I don't expect this to >> create an IO rate issue for long queue depths. >> Indeed we will have a tradeoff here of latency vs rate. > > I suggest we hold back with this one for now, I have fixes for > the local invalidate flow. Once we get them in we can see if > this approach makes a difference, and if so, we need to expose > it through a generic mechanism. I suggest that you'll post the patches to the list and we'll investigate the latency implications, if the implications are negligible, we have a win here for small IOs. When will you be able to post your patches? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> I agree we should fix this, nevertheless I don't expect this to >>> create an IO rate issue for long queue depths. >>> Indeed we will have a tradeoff here of latency vs rate. >> >> I suggest we hold back with this one for now, I have fixes for >> the local invalidate flow. Once we get them in we can see if >> this approach makes a difference, and if so, we need to expose >> it through a generic mechanism. > > I suggest that you'll post the patches to the list and we'll investigate > the latency implications, if the implications are negligible, we have a > win here for small IOs. > > When will you be able to post your patches? I can post them now, but I prefer to test them first ;) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/30/2017 12:33 PM, Sagi Grimberg wrote: > > >>>> I agree we should fix this, nevertheless I don't expect this to >>>> create an IO rate issue for long queue depths. >>>> Indeed we will have a tradeoff here of latency vs rate. >>> >>> I suggest we hold back with this one for now, I have fixes for >>> the local invalidate flow. Once we get them in we can see if >>> this approach makes a difference, and if so, we need to expose >>> it through a generic mechanism. >> >> I suggest that you'll post the patches to the list and we'll investigate >> the latency implications, if the implications are negligible, we have a >> win here for small IOs. >> >> When will you be able to post your patches? > > I can post them now, but I prefer to test them first ;) Sounds good, we will check once you'll post them. I expect no impact on IOPs, just latency. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Oct 29, 2017, at 2:24 PM, Chuck Lever <chuck.lever@oracle.com> wrote: > > >> On Oct 29, 2017, at 12:38 PM, idanb@mellanox.com wrote: >> >> From: Idan Burstein <idanb@mellanox.com> >> >> NVMe over Fabrics in its secure "register_always" mode >> registers and invalidates the user buffer upon each IO. >> The protocol enables the host to request the susbsystem >> to use SEND WITH INVALIDATE operation while returning the >> response capsule and invalidate the local key >> (remote_invalidation). >> In some HW implementations, the local network adapter may >> perform better while using local invalidation operations. >> >> The results below show that running with local invalidation >> rather then remote invalidation improve the iops one could >> achieve by using the ConnectX-5Ex network adapter by x1.36 factor. >> Nevertheless, using local invalidation induce more CPU overhead >> than enabling the target to invalidate remotly, therefore, >> because there is a CPU% vs IOPs limit tradeoff we propose to >> have module parameter to define whether to request remote >> invalidation or not. >> >> The following results were taken against a single nvme over fabrics >> subsystem with a single namespace backed by null_blk: >> >> Block Size s/g reg_wr inline reg_wr inline reg_wr + local inv >> ++++++++++++ ++++++++++++++ ++++++++++++++++ +++++++++++++++++++++++++++ >> 512B 1446.6K/8.57% 5224.2K/76.21% 7143.3K/79.72% >> 1KB 1390.6K/8.5% 4656.7K/71.69% 5860.6K/55.45% >> 2KB 1343.8K/8.6% 3410.3K/38.96% 4106.7K/55.82% >> 4KB 1254.8K/8.39% 2033.6K/15.86% 2165.3K/17.48% >> 8KB 1079.5K/7.7% 1143.1K/7.27% 1158.2K/7.33% >> 16KB 603.4K/3.64% 593.8K/3.4% 588.9K/3.77% >> 32KB 294.8K/2.04% 293.7K/1.98% 294.4K/2.93% >> 64KB 138.2K/1.32% 141.6K/1.26% 135.6K/1.34% > > Units reported here are KIOPS and %CPU ? What was the benchmark? > > Was any root cause analysis done to understand why the IOPS > rate changes without RI? Reduction in avg RTT? Fewer long- > running outliers? Lock contention in the ULP? > > I am curious enough to add a similar setting to NFS/RDMA, > now that I have mlx5 devices. I see the plan is to change the NVMeoF initiator to wait for invalidation to complete, and then test again. However, I am still curious what we're dealing with for NFS/RDMA (which always waits for invalidation before allowing an RPC to complete). I tested here with a pair of EDR CX-4s in RoCE mode. I found that there is a 15-20us latency penalty when Remote Invalid- ation is used with RDMA Read, but when used with RDMA Write, Remote Invalidation behaves as desired. The RDMA Read penalty vanishes after payloads are larger than about 8kB. Simple QD=1 iozone test with direct I/O on NFSv3, and a tmpfs share on the NFS server. In this test, memory registration is used for all data payloads. Remote Invalidation enabled, reclen in kB, output in microseconds: kB reclen write rewrite read reread 131072 1 47 54 27 27 131072 2 61 62 27 28 131072 4 63 62 28 28 131072 8 59 65 29 29 131072 16 67 66 31 32 131072 32 75 73 42 42 131072 64 92 87 64 44 Remote Invalidation disabled, reclen in kB, output in microseconds kB reclen write rewrite read reread 131072 1 43 43 32 32 131072 2 45 52 32 32 131072 4 48 45 32 32 131072 8 68 64 33 33 131072 16 74 60 35 35 131072 32 85 82 49 41 131072 64 102 98 63 52 I would expect a similar ~5us latency benefit for both RDMA Reads and RDMA Writes when Remote Invalidation is in use. Small I/Os involving RDMA Read are precisely where NFS/RDMA gains the most benefit from Remote Invalidation, so this result is disappointing. Unfortunately, the current version of RPC-over-RDMA does not have the ability to convey an Rkey for the target to invalid- ate remotely, though that is one of the features being considered for the next version. IOPS results (fio 70/30, multi-thread iozone, etc) are very clearly worse without Remote Invalidation, so I currently don't plan to submit a patch that allows RI to be disabled for some NFS/RDMA mounts. Currently the Linux NFS client indicates to servers that RI is supported whenever FRWR is supported on the client's HCA. The server is then free to decide whether to use Send With Invalidate with any Rkey in the current RPC. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 92a03ff..7f8225d 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -146,6 +146,11 @@ static inline struct nvme_rdma_ctrl *to_rdma_ctrl(struct nvme_ctrl *ctrl) MODULE_PARM_DESC(register_always, "Use memory registration even for contiguous memory regions"); +static bool remote_invalidation = true; +module_param(remote_invalidation, bool, 0444); +MODULE_PARM_DESC(remote_invalidation, + "request remote invalidation from subsystem (default: true)"); + static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id, struct rdma_cm_event *event); static void nvme_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc); @@ -1152,8 +1157,9 @@ static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue, sg->addr = cpu_to_le64(req->mr->iova); put_unaligned_le24(req->mr->length, sg->length); put_unaligned_le32(req->mr->rkey, sg->key); - sg->type = (NVME_KEY_SGL_FMT_DATA_DESC << 4) | - NVME_SGL_FMT_INVALIDATE; + sg->type = NVME_KEY_SGL_FMT_DATA_DESC << 4; + if (remote_invalidation) + sg->type |= NVME_SGL_FMT_INVALIDATE; return 0; }