Message ID | 614481c7-22dd-d93b-e97e-52f868727ec3@grimberg.me (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
> On Jun 20, 2017, at 8:02 AM, Sagi Grimberg <sagi@grimberg.me> wrote: > > >>>> Can you share the check that correlates to the vendor+hw syndrome? >>> >>> mkey.free == 1 >> Hmm, the way I understand it is that the HW is trying to access >> (locally via send) a MR which was already invalidated. >> Thinking of this further, this can happen in a case where the target >> already completed the transaction, sent SEND_WITH_INVALIDATE but the >> original send ack was lost somewhere causing the device to retransmit >> from the MR (which was already invalidated). This is highly unlikely >> though. >> Shouldn't this be protected somehow by the device? >> Can someone explain why the above cannot happen? Jason? Liran? Anyone? >> Say host register MR (a) and send (1) from that MR to a target, >> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE >> on MR (a) and the host HCA process it, then host HCA timeout on send (1) >> so it retries, but ehh, its already invalidated. > > Well, this entire flow is broken, why should the host send the MR rkey > to the target if it is not using it for remote access, the target > should never have a chance to remote invalidate something it did not > access. > > I think we have a bug in iSER code, as we should not send the key > for remote invalidation if we do inline data send... > > Robert, can you try the following: > -- > diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c > index 12ed62ce9ff7..2a07692007bd 100644 > --- a/drivers/infiniband/ulp/iser/iser_initiator.c > +++ b/drivers/infiniband/ulp/iser/iser_initiator.c > @@ -137,8 +137,10 @@ iser_prepare_write_cmd(struct iscsi_task *task, > > if (unsol_sz < edtl) { > hdr->flags |= ISER_WSV; > - hdr->write_stag = cpu_to_be32(mem_reg->rkey); > - hdr->write_va = cpu_to_be64(mem_reg->sge.addr + unsol_sz); > + if (buf_out->data_len > imm_sz) { > + hdr->write_stag = cpu_to_be32(mem_reg->rkey); > + hdr->write_va = cpu_to_be64(mem_reg->sge.addr + unsol_sz); > + } > > iser_dbg("Cmd itt:%d, WRITE tags, RKEY:%#.4X " > "VA:%#llX + unsol:%d\n", > -- > > Although, I still don't think its enough. We need to delay the > local invalidate till we received a send completion (guarantees > that ack was received)... > > If this indeed the case, _all_ ULP initiator drivers share it because we > never condition on a send completion in order to complete an I/O, and > in the case of lost ack on send, looks like we need to... It *will* hurt > performance. > > What do other folks think? > > CC'ing Bart, Chuck, Christoph. > > Guys, for summary, I think we might have a broken behavior in the > initiator mode drivers. We never condition send completions (for > requests) before we complete an I/O. The issue is that the ack for those > sends might get lost, which means that the HCA will retry them (dropped > by the peer HCA) but if we happen to complete the I/O before, either we > can unmap the request area, or for inline data, we invalidate it (so the > HCA will try to access a MR which was invalidated). So on occasion there is a Remote Access Error. That would trigger connection loss, and the retransmitted Send request is discarded (if there was externally exposed memory involved with the original transaction that is now invalid). NFS has a duplicate replay cache. If it sees a repeated RPC XID it will send a cached reply. I guess the trick there is to squelch remote invalidation for such retransmits to avoid spurious Remote Access Errors. Should be rare, though. RPC-over-RDMA uses persistent registration for its inline buffers. The problem there is avoiding buffer reuse to soon. Otherwise a garbled inline message is presented on retransmit. Those would probably not be caught by the DRC. But the real problem is preventing retransmitted Sends from causing a ULP request to be executed multiple times. > Signalling all send completions and also finishing I/Os only after we > got them will add latency, and that sucks... Typically, Sends will complete before the response arrives. The additional cost will be handling extra interrupts, IMO. With FRWR, won't subsequent WRs be delayed until the HCA is done with the Send? I don't think a signal is necessary in every case. Send Queue accounting currently relies on that. RPC-over-RDMA relies on the completion of Local Invalidation to ensure that the initial Send WR is complete. For Remote Invalidation and pure inline, there is nothing to fence that Send. The question I have is: how often do these Send retransmissions occur? Is it enough to have a robust recovery mechanism, or do we have to wire in assumptions about retransmission to every Send operation? -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 20, 2017 at 6:02 AM, Sagi Grimberg <sagi@grimberg.me> wrote: > >>>> Can you share the check that correlates to the vendor+hw syndrome? >>> >>> >>> mkey.free == 1 >> >> >> Hmm, the way I understand it is that the HW is trying to access >> (locally via send) a MR which was already invalidated. >> >> Thinking of this further, this can happen in a case where the target >> already completed the transaction, sent SEND_WITH_INVALIDATE but the >> original send ack was lost somewhere causing the device to retransmit >> from the MR (which was already invalidated). This is highly unlikely >> though. >> >> Shouldn't this be protected somehow by the device? >> Can someone explain why the above cannot happen? Jason? Liran? Anyone? >> >> Say host register MR (a) and send (1) from that MR to a target, >> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE >> on MR (a) and the host HCA process it, then host HCA timeout on send (1) >> so it retries, but ehh, its already invalidated. > > > Well, this entire flow is broken, why should the host send the MR rkey > to the target if it is not using it for remote access, the target > should never have a chance to remote invalidate something it did not > access. > > I think we have a bug in iSER code, as we should not send the key > for remote invalidation if we do inline data send... > > Robert, can you try the following: > -- > diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c > b/drivers/infiniband/ulp/iser/iser_initiator.c > index 12ed62ce9ff7..2a07692007bd 100644 > --- a/drivers/infiniband/ulp/iser/iser_initiator.c > +++ b/drivers/infiniband/ulp/iser/iser_initiator.c > @@ -137,8 +137,10 @@ iser_prepare_write_cmd(struct iscsi_task *task, > > if (unsol_sz < edtl) { > hdr->flags |= ISER_WSV; > - hdr->write_stag = cpu_to_be32(mem_reg->rkey); > - hdr->write_va = cpu_to_be64(mem_reg->sge.addr + unsol_sz); > + if (buf_out->data_len > imm_sz) { > + hdr->write_stag = cpu_to_be32(mem_reg->rkey); > + hdr->write_va = cpu_to_be64(mem_reg->sge.addr + > unsol_sz); > + } > > iser_dbg("Cmd itt:%d, WRITE tags, RKEY:%#.4X " > "VA:%#llX + unsol:%d\n", > -- > > Although, I still don't think its enough. We need to delay the > local invalidate till we received a send completion (guarantees > that ack was received)... > > If this indeed the case, _all_ ULP initiator drivers share it because we > never condition on a send completion in order to complete an I/O, and > in the case of lost ack on send, looks like we need to... It *will* hurt > performance. > > What do other folks think? > > CC'ing Bart, Chuck, Christoph. > > Guys, for summary, I think we might have a broken behavior in the > initiator mode drivers. We never condition send completions (for > requests) before we complete an I/O. The issue is that the ack for those > sends might get lost, which means that the HCA will retry them (dropped > by the peer HCA) but if we happen to complete the I/O before, either we > can unmap the request area, or for inline data, we invalidate it (so the > HCA will try to access a MR which was invalidated). > > Signalling all send completions and also finishing I/Os only after we > got them will add latency, and that sucks... Testing this patch I didn't see these new messages even when rebooting the targets multiple times. It also resolved some performance problems I was seeing (I think our switches are having bugs with IPv6 and routing) and I was receiving expected performance. At one point in the test, one target (4.9.33) showed: [Tue Jun 20 10:11:20 2017] iSCSI Login timeout on Network Portal [::]:3260 [Tue Jun 20 10:11:39 2017] iSCSI Login timeout on Network Portal [::]:3260 [Tue Jun 20 10:11:58 2017] isert: isert_print_wc: login send failure: transport retry counter exceeded (12) vend_err 81 After this and a reboot of the target, the initiator would drop the connection after 1.5-2 minutes then faster and faster until it was every 5 seconds. It is almost like it set up the connection then lose the first ping, or the ping wasn't set-up right. I tried rebooting the target multiple times. I tried to logout the "bad" session and got a back trace. [Tue Jun 20 10:30:08 2017] ------------[ cut here ]------------ [Tue Jun 20 10:30:08 2017] WARNING: CPU: 20 PID: 783 at fs/sysfs/group.c:237 sysfs_remove_group+0x82/0x90 [Tue Jun 20 10:30:08 2017] Modules linked in: ib_iser rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib mlx4_ib ib_core 8021q garp mrp sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support mei_me irqbypass crct10dif_pclmul mei crc32_pclmul ioatdma ghash_clmulni_intel lpc_ich i2c_i801 pcbc mfd_core aesni_intel crypto_simd glue_helper cryptd joydevsg pcspkr shpchp wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 acpi_pad nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c raid1 dm_service_time sd_mod mlx4_en be2iscsi bnx2i cnic uio qla4xxx iscsi_boot_sysfs ast drm_kms_helper mlx5_core syscopyarea sysfillrect sysimgblt fb_sys_fops ttm [Tue Jun 20 10:30:08 2017] drm mlx4_core igb ahci libahci ptp libata pps_core dca i2c_algo_bit dm_multipath dm_mirror dm_region_hash dm_log dm_mod dax [Tue Jun 20 10:30:08 2017] CPU: 20 PID: 783 Comm: kworker/u64:2 Not tainted 4.12.0-rc6+ #5 [Tue Jun 20 10:30:08 2017] Hardware name: Supermicro SYS-6028TP-HTFR/X10DRT-PIBF, BIOS 1.1 08/03/2015 [Tue Jun 20 10:30:08 2017] Workqueue: scsi_wq_12 __iscsi_unbind_session [Tue Jun 20 10:30:08 2017] task: ffff887f5c45cb00 task.stack: ffffc90032ef4000 [Tue Jun 20 10:30:08 2017] RIP: 0010:sysfs_remove_group+0x82/0x90 [Tue Jun 20 10:30:08 2017] RSP: 0018:ffffc90032ef7d18 EFLAGS: 00010246 [Tue Jun 20 10:30:08 2017] RAX: 0000000000000038 RBX: 0000000000000000 RCX: 0000000000000000 [Tue Jun 20 10:30:08 2017] RDX: 0000000000000000 RSI: ffff883f7fd0e068 RDI: ffff883f7fd0e068 [Tue Jun 20 10:30:08 2017] RBP: ffffc90032ef7d30 R08: 0000000000000000 R09: 0000000000000676 [Tue Jun 20 10:30:08 2017] R10: 00000000000003ff R11: 0000000000000001 R12: ffffffff81da8a40 [Tue Jun 20 10:30:08 2017] R13: ffff887f52ec0838 R14: ffff887f52ec08d8 R15: ffff883f4c611000 [Tue Jun 20 10:30:08 2017] FS: 0000000000000000(0000) GS:ffff883f7fd00000(0000) knlGS:0000000000000000 [Tue Jun 20 10:30:08 2017] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Tue Jun 20 10:30:08 2017] CR2: 000055eef886b398 CR3: 0000000001c09000 CR4: 00000000001406e0 [Tue Jun 20 10:30:08 2017] Call Trace: [Tue Jun 20 10:30:08 2017] dpm_sysfs_remove+0x57/0x60 [Tue Jun 20 10:30:08 2017] device_del+0x107/0x330 [Tue Jun 20 10:30:08 2017] scsi_target_reap_ref_release+0x2d/0x40 [Tue Jun 20 10:30:08 2017] scsi_target_reap+0x2e/0x40 [Tue Jun 20 10:30:08 2017] scsi_remove_target+0x197/0x1b0 [Tue Jun 20 10:30:08 2017] __iscsi_unbind_session+0xbe/0x170 [Tue Jun 20 10:30:08 2017] process_one_work+0x149/0x360 [Tue Jun 20 10:30:08 2017] worker_thread+0x4d/0x3c0 [Tue Jun 20 10:30:08 2017] kthread+0x109/0x140 [Tue Jun 20 10:30:08 2017] ? rescuer_thread+0x380/0x380 [Tue Jun 20 10:30:08 2017] ? kthread_park+0x60/0x60 [Tue Jun 20 10:30:08 2017] ret_from_fork+0x25/0x30 [Tue Jun 20 10:30:08 2017] Code: d5 c0 ff ff 5b 41 5c 41 5d 5d c3 48 89 df e8 66 bd ff ff eb c6 49 8b 55 00 49 8b 34 24 48 c7 c7 d0 3ca7 81 31 c0 e8 3c 01 ee ff <0f> ff eb d5 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 [Tue Jun 20 10:30:08 2017] ---[ end trace 6161f21139b6a1ea ]--- Logging back into that target didn't help stabilize the connection. I rebooted both initiator and targets to clear things up and after the initiator went down, the target showed the timeout message again. It seems something got out of whack and never recovered and "poisoned" the other node in the process. I'll test Max's patch now and report back. ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> So on occasion there is a Remote Access Error. That would > trigger connection loss, and the retransmitted Send request > is discarded (if there was externally exposed memory involved > with the original transaction that is now invalid). I'm actually not concerned about the remote invalidation, that is good that its discarded or failed. Its the inline sends that are a bug here. > But the real problem is preventing retransmitted Sends from > causing a ULP request to be executed multiple times. Exactly. >> Signalling all send completions and also finishing I/Os only after we >> got them will add latency, and that sucks... > > Typically, Sends will complete before the response arrives. > The additional cost will be handling extra interrupts, IMO. Not quite, heavy traffic _can_ results in dropped acks, my gut feeling is that it can happen more than we suspect. and yea, extra interrupt, extra cachelines, extra state, but I do not see any other way around it. > With FRWR, won't subsequent WRs be delayed until the HCA is > done with the Send? I don't think a signal is necessary in > every case. Send Queue accounting currently relies on that. Not really, the Send after the FRWR might have a fence (not strong ordered one) and CX3/CX4 strong order FRWR so for them that is a non-issue. The problem is that ULPs can't rely on it. > RPC-over-RDMA relies on the completion of Local Invalidation > to ensure that the initial Send WR is complete. Wait, is that guaranteed? > For Remote > Invalidation and pure inline, there is nothing to fence that > Send. > > The question I have is: how often do these Send retransmissions > occur? Is it enough to have a robust recovery mechanism, or > do we have to wire in assumptions about retransmission to > every Send operation? Even if its rare, we don't have any way to protect against devices retrying the send operation. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Testing this patch I didn't see these new messages even when rebooting > the targets multiple times. It also resolved some performance problems > I was seeing (I think our switches are having bugs with IPv6 and > routing) and I was receiving expected performance. At one point in the > test, one target (4.9.33) showed: > [Tue Jun 20 10:11:20 2017] iSCSI Login timeout on Network Portal [::]:3260 > [Tue Jun 20 10:11:39 2017] iSCSI Login timeout on Network Portal [::]:3260 > [Tue Jun 20 10:11:58 2017] isert: isert_print_wc: login send failure: > transport retry counter exceeded (12) vend_err 81 I don't understand, is this new with the patch applied? > After this and a reboot of the target, the initiator would drop the > connection after 1.5-2 minutes then faster and faster until it was > every 5 seconds. It is almost like it set up the connection then lose > the first ping, or the ping wasn't set-up right. I tried rebooting the > target multiple times. So the initiator could not recover even after the target as available again? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 20, 2017 at 11:19 AM, Sagi Grimberg <sagi@grimberg.me> wrote: > >> Testing this patch I didn't see these new messages even when rebooting >> the targets multiple times. It also resolved some performance problems >> I was seeing (I think our switches are having bugs with IPv6 and >> routing) and I was receiving expected performance. At one point in the >> test, one target (4.9.33) showed: >> [Tue Jun 20 10:11:20 2017] iSCSI Login timeout on Network Portal [::]:3260 >> [Tue Jun 20 10:11:39 2017] iSCSI Login timeout on Network Portal [::]:3260 >> [Tue Jun 20 10:11:58 2017] isert: isert_print_wc: login send failure: >> transport retry counter exceeded (12) vend_err 81 > > > I don't understand, is this new with the patch applied? I applied your patch to 4.12-rc6 on the initiator, but my targets are still 4.9.33 since it looked like the patch only affected the initiator. I did not see this before your patch, but I also didn't try rebooting the targets multiple times before because of the previous messages. >> After this and a reboot of the target, the initiator would drop the >> connection after 1.5-2 minutes then faster and faster until it was >> every 5 seconds. It is almost like it set up the connection then lose >> the first ping, or the ping wasn't set-up right. I tried rebooting the >> target multiple times. > > > So the initiator could not recover even after the target as available > again? The initiator recovered the connection when the target came back, but the connection was not stable. I/O would happen on the connection, then it would get shaky and then finally disconnect. Then it would reconnect, pass more I/O, then get shaky and go down again. With the 5 second disconnects, it would pass traffic for 5 seconds, then as soon as I saw the ping timeout, the I/O would stop until it reconnected. At that point it seems that the lack of pings would kill the I/O unlike earlier where there was a stall in I/O and then the connection would be torn down. I can try to see if I can get it to happen again. ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 20, 2017 at 01:01:39PM -0400, Chuck Lever wrote: > >> Shouldn't this be protected somehow by the device? > >> Can someone explain why the above cannot happen? Jason? Liran? Anyone? > >> Say host register MR (a) and send (1) from that MR to a target, > >> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE > >> on MR (a) and the host HCA process it, then host HCA timeout on send (1) > >> so it retries, but ehh, its already invalidated. I'm not sure I understand the example.. but... If you pass a MR key to a send, then that MR must remain valid until the send completion is implied by an observation on the CQ. The HCA is free to re-execute the SEND against the MR at any time up until the completion reaches the CQ. As I've explained before, a ULP must not use 'implied completion', eg a receive that could only have happened if the far side got the send. In particular this means it cannot use an incoming SEND_INV/etc to invalidate an MR associated with a local SEND, as that is a form of 'implied completion' For sanity a MR associated with a local send should not be remote accessible at all, and shouldn't even have a 'rkey', just a 'lkey'. Similarly, you cannot use a MR with SEND and remote access sanely, as the far end could corrupt or invalidate the MR while the local HCA is still using it. > So on occasion there is a Remote Access Error. That would > trigger connection loss, and the retransmitted Send request > is discarded (if there was externally exposed memory involved > with the original transaction that is now invalid). Once you get a connection loss I would think the state of all the MRs need to be resync'd. Running through the CQ should indicate which ones are invalidate and which ones are still good. > NFS has a duplicate replay cache. If it sees a repeated RPC > XID it will send a cached reply. I guess the trick there is > to squelch remote invalidation for such retransmits to avoid > spurious Remote Access Errors. Should be rare, though. .. and because of the above if a RPC is re-issued it must be re-issued with corrected, now-valid rkeys, and the sender must somehow detect that the far side dropped it for replay and tear down the MRs. > RPC-over-RDMA uses persistent registration for its inline > buffers. The problem there is avoiding buffer reuse to soon. > Otherwise a garbled inline message is presented on retransmit. > Those would probably not be caught by the DRC. We've had this discussion on the list before. You can *never* re-use a SEND, or RDMA WRITE buffer until you observe the HCA is done with it via a CQ poll. > But the real problem is preventing retransmitted Sends from > causing a ULP request to be executed multiple times. IB RC guarentees single delivery for SEND, so that doesn't seem possible unless the ULP re-transmits the SEND on a new QP. > > Signalling all send completions and also finishing I/Os only after > > we got them will add latency, and that sucks... There is no choice, you *MUST* see the send completion before reclamining any resources associated with the send. Only the completion guarentees that the HCA will not resend the packet or otherwise continue to use the resources. > With FRWR, won't subsequent WRs be delayed until the HCA is > done with the Send? I don't think a signal is necessary in > every case. Send Queue accounting currently relies on that. No. The SQ side is asynchronous to the CQ side, the HCA will pipeline send packets on the wire up to some internal limit. Only the local state changed by FRWR related op codes happens sequentially with other SQ work. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Jun 20, 2017, at 1:35 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > > On Tue, Jun 20, 2017 at 01:01:39PM -0400, Chuck Lever wrote: > >>>> Shouldn't this be protected somehow by the device? >>>> Can someone explain why the above cannot happen? Jason? Liran? Anyone? >>>> Say host register MR (a) and send (1) from that MR to a target, >>>> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE >>>> on MR (a) and the host HCA process it, then host HCA timeout on send (1) >>>> so it retries, but ehh, its already invalidated. > > I'm not sure I understand the example.. but... > > If you pass a MR key to a send, then that MR must remain valid until > the send completion is implied by an observation on the CQ. The HCA is > free to re-execute the SEND against the MR at any time up until the > completion reaches the CQ. > > As I've explained before, a ULP must not use 'implied completion', eg > a receive that could only have happened if the far side got the > send. In particular this means it cannot use an incoming SEND_INV/etc > to invalidate an MR associated with a local SEND, as that is a form > of 'implied completion' > > For sanity a MR associated with a local send should not be remote > accessible at all, and shouldn't even have a 'rkey', just a 'lkey'. > > Similarly, you cannot use a MR with SEND and remote access sanely, as > the far end could corrupt or invalidate the MR while the local HCA is > still using it. > >> So on occasion there is a Remote Access Error. That would >> trigger connection loss, and the retransmitted Send request >> is discarded (if there was externally exposed memory involved >> with the original transaction that is now invalid). > > Once you get a connection loss I would think the state of all the MRs > need to be resync'd. Running through the CQ should indicate which ones > are invalidate and which ones are still good. > >> NFS has a duplicate replay cache. If it sees a repeated RPC >> XID it will send a cached reply. I guess the trick there is >> to squelch remote invalidation for such retransmits to avoid >> spurious Remote Access Errors. Should be rare, though. > > .. and because of the above if a RPC is re-issued it must be re-issued > with corrected, now-valid rkeys, and the sender must somehow detect > that the far side dropped it for replay and tear down the MRs. Yes, if RPC-over-RDMA ULP is involved, any externally accessible memory will be re-registered before an RPC retransmission. The concern is whether a retransmitted Send will be exposed to the receiving ULP. Below you imply that it will not be, so perhaps this is not a concern after all. >> RPC-over-RDMA uses persistent registration for its inline >> buffers. The problem there is avoiding buffer reuse to soon. >> Otherwise a garbled inline message is presented on retransmit. >> Those would probably not be caught by the DRC. > > We've had this discussion on the list before. You can *never* re-use a > SEND, or RDMA WRITE buffer until you observe the HCA is done with it > via a CQ poll. RPC-over-RDMA is careful to invalidate buffers that are the target of RDMA Write before RPC completion, as we have discussed before. Sends are assumed to be complete when a LocalInv completes. When we had this discussion before, you explained the problem with retransmitted Sends, but it appears that all the ULPs we have operate without Send completion. Others whom I trust have suggested that operating without that extra interrupt is preferred. The client has operated this way since it was added to the kernel almost 10 years ago. So I took it as a "in a perfect world" kind of admonition. You are making a stronger and more normative assertion here. >> But the real problem is preventing retransmitted Sends from >> causing a ULP request to be executed multiple times. > > IB RC guarentees single delivery for SEND, so that doesn't seem > possible unless the ULP re-transmits the SEND on a new QP. > >>> Signalling all send completions and also finishing I/Os only after >>> we got them will add latency, and that sucks... > > There is no choice, you *MUST* see the send completion before > reclamining any resources associated with the send. Only the > completion guarentees that the HCA will not resend the packet or > otherwise continue to use the resources. On the NFS server side, I believe every Send is signaled. On the NFS client side, we assume LocalInv completion is good enough. >> With FRWR, won't subsequent WRs be delayed until the HCA is >> done with the Send? I don't think a signal is necessary in >> every case. Send Queue accounting currently relies on that. > > No. The SQ side is asynchronous to the CQ side, the HCA will pipeline > send packets on the wire up to some internal limit. So if my ULP issues FastReg followed by Send followed by LocalInv (signaled), I can't rely on the LocalInv completion to imply that the Send is also complete? > Only the local state changed by FRWR related op codes happens > sequentially with other SQ work. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 20, 2017 at 02:17:39PM -0400, Chuck Lever wrote: > The concern is whether a retransmitted Send will be exposed > to the receiving ULP. Below you imply that it will not be, so > perhaps this is not a concern after all. A retransmitted SEND will never be exposed to the Reciever ULP for Reliable Connected. That is part of the guarantee. > > We've had this discussion on the list before. You can *never* re-use a > > SEND, or RDMA WRITE buffer until you observe the HCA is done with it > > via a CQ poll. > > RPC-over-RDMA is careful to invalidate buffers that are the > target of RDMA Write before RPC completion, as we have > discussed before. > > Sends are assumed to be complete when a LocalInv completes. > > When we had this discussion before, you explained the problem > with retransmitted Sends, but it appears that all the ULPs we > have operate without Send completion. Others whom I trust have > suggested that operating without that extra interrupt is Operating without the interrupt is of course preferred, but that means you have to defer the invalidate for MR's refered to by SEND until a CQ observation as well. > preferred. The client has operated this way since it was added > to the kernel almost 10 years ago. I thought the use of MR's with SEND was a new invention? If you use the local rdma lkey with send, it is never invalidated, and this is not an issue, which IIRC, was the historical configuration for NFS. > So I took it as a "in a perfect world" kind of admonition. > You are making a stronger and more normative assertion here. All ULPs must have periodic (related to SQ depth) signaled completions or some of our supported hardware will explode. All ULPs must flow control additions to the SQ based on CQ feedback, or they will fail under load with SQ overflows, if this is done, then the above happens correctly for free. All ULPs must ensure SEND/RDMA Write resources remain stable until the CQ indicates that work is completed. 'In a perfect world' this includes not changing the source memory as that would cause retransmitted packets to be different. All ULPs must ensure the lkey remains valid until the CQ confirms the work is done. This is not important if the lkey is always the local rdma lkey, which is always valid. > > No. The SQ side is asynchronous to the CQ side, the HCA will pipeline > > send packets on the wire up to some internal limit. > > So if my ULP issues FastReg followed by Send followed by > LocalInv (signaled), I can't rely on the LocalInv completion > to imply that the Send is also complete? Correct. This is explicitly defined in Table 79 of the IBA. It describes the ordering requirements, if you order Send followed by LocalInv the ordering is 'L' which means they are not ordered unless the WR has the Local Invalidate Fence bit set. LIF is an optional feature, I do not know if any of our hardware supports it, but it is defined to cause the local invalidate to wait until all ongoing references to the MR are completed. No idea on the relative performance of LIF vs doing it manually, but the need for one or the other is unambigously clear in the spec. Why are you invaliding lkeys anyhow, that doesn't seem like something that needs to happen synchronously. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Jun 20, 2017, at 3:27 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > > On Tue, Jun 20, 2017 at 02:17:39PM -0400, Chuck Lever wrote: > >> The concern is whether a retransmitted Send will be exposed >> to the receiving ULP. Below you imply that it will not be, so >> perhaps this is not a concern after all. > > A retransmitted SEND will never be exposed to the Reciever ULP for > Reliable Connected. That is part of the guarantee. > >>> We've had this discussion on the list before. You can *never* re-use a >>> SEND, or RDMA WRITE buffer until you observe the HCA is done with it >>> via a CQ poll. >> >> RPC-over-RDMA is careful to invalidate buffers that are the >> target of RDMA Write before RPC completion, as we have >> discussed before. >> >> Sends are assumed to be complete when a LocalInv completes. >> >> When we had this discussion before, you explained the problem >> with retransmitted Sends, but it appears that all the ULPs we >> have operate without Send completion. Others whom I trust have >> suggested that operating without that extra interrupt is > > Operating without the interrupt is of course preferred, but that means > you have to defer the invalidate for MR's refered to by SEND until a > CQ observation as well. > >> preferred. The client has operated this way since it was added >> to the kernel almost 10 years ago. > > I thought the use of MR's with SEND was a new invention? If you use > the local rdma lkey with send, it is never invalidated, and this is > not an issue, which IIRC, was the historical configuration for NFS. We may be conflating things a bit. RPC-over-RDMA client uses persistently registered buffers, using the lkey, for inline data. The use of MRs is reserved for NFS READ and WRITE payloads. The inline buffers are never explicitly invalidated by RPC-over-RDMA. >> So I took it as a "in a perfect world" kind of admonition. >> You are making a stronger and more normative assertion here. > > All ULPs must have periodic (related to SQ depth) signaled completions > or some of our supported hardware will explode. RPC-over-RDMA client does that. > All ULPs must flow control additions to the SQ based on CQ feedback, > or they will fail under load with SQ overflows, if this is done, then > the above happens correctly for free. RPC-over-RDMA client does that. > All ULPs must ensure SEND/RDMA Write resources remain stable until the > CQ indicates that work is completed. 'In a perfect world' this > includes not changing the source memory as that would cause > retransmitted packets to be different. I assume you mean the sending side (the server) for RDMA Write. I believe rdma_rw uses the local rdma lkey by default for RDMA Write source buffers. > All ULPs must ensure the lkey remains valid until the CQ confirms > the work is done. This is not important if the lkey is always the > local rdma lkey, which is always valid. As above, Send buffers use the local rdma lkey. >>> No. The SQ side is asynchronous to the CQ side, the HCA will pipeline >>> send packets on the wire up to some internal limit. >> >> So if my ULP issues FastReg followed by Send followed by >> LocalInv (signaled), I can't rely on the LocalInv completion >> to imply that the Send is also complete? > > Correct. > > This is explicitly defined in Table 79 of the IBA. > > It describes the ordering requirements, if you order Send followed by > LocalInv the ordering is 'L' which means they are not ordered unless > the WR has the Local Invalidate Fence bit set. > > LIF is an optional feature, I do not know if any of our hardware > supports it, but it is defined to cause the local invalidate to wait > until all ongoing references to the MR are completed. Now, since there was confusion about using an MR for a Send operation, let me clarify. If the client does: FastReg(payload buffer) Send(inline buffer) ... Recv LocalInv(payload buffer) wait for LI completion Is setting IB_SEND_FENCE on the LocalInv enough to ensure that the Send is complete? cscope seems to suggest all our devices support IB_SEND_FENCE. Sagi mentioned some devices do this fencing automatically. > No idea on the relative performance of LIF vs doing it manually, but > the need for one or the other is unambigously clear in the spec. It seems to me that the guarantee that the server sees only one copy of the Send payload is good enough. That means that by the time Recv completion occurs on the client, even if the client HCA still thinks it needs to retransmit the Send containing the RPC Call, the server ULP has already seen and processed that Send payload, and the HCA on the server won't deliver that payload a second time. The RPC Reply is evidence that the server saw the correct RPC Call message payload, and the client always preserves the Send's inline buffer until the reply has been received. If the only concern about preserving that inline buffer is guaranteeing that retransmits contain the same content, I don't think we have a problem. All HCA retransmits of an RPC Call, until the matching RPC Reply is received on the client, will contain the same content. The issue about the HCA not being able to access the inline buffer during a retransmit is also not an issue for RPC- over-RDMA because these buffers are always registered with the local rdma lkey. > Why are you invaliding lkeys anyhow, that doesn't seem like something > that needs to happen synchronously. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 20, 2017 at 04:56:39PM -0400, Chuck Lever wrote: > > I thought the use of MR's with SEND was a new invention? If you use > > the local rdma lkey with send, it is never invalidated, and this is > > not an issue, which IIRC, was the historical configuration for NFS. > > We may be conflating things a bit. > > RPC-over-RDMA client uses persistently registered buffers, using > the lkey, for inline data. The use of MRs is reserved for NFS READ > and WRITE payloads. The inline buffers are never explicitly > invalidated by RPC-over-RDMA. That makes much more sense, but is that the original question in this thread? Why are we even talking about invalidate ordering then? > > All ULPs must ensure SEND/RDMA Write resources remain stable until the > > CQ indicates that work is completed. 'In a perfect world' this > > includes not changing the source memory as that would cause > > retransmitted packets to be different. > > I assume you mean the sending side (the server) for RDMA > Write. I believe rdma_rw uses the local rdma lkey by default > for RDMA Write source buffers. RDMA Write or SEND > >>> No. The SQ side is asynchronous to the CQ side, the HCA will pipeline > >>> send packets on the wire up to some internal limit. > >> > >> So if my ULP issues FastReg followed by Send followed by > >> LocalInv (signaled), I can't rely on the LocalInv completion > >> to imply that the Send is also complete? > > > > Correct. > > > > This is explicitly defined in Table 79 of the IBA. > > > > It describes the ordering requirements, if you order Send followed by > > LocalInv the ordering is 'L' which means they are not ordered unless > > the WR has the Local Invalidate Fence bit set. > > > > LIF is an optional feature, I do not know if any of our hardware > > supports it, but it is defined to cause the local invalidate to wait > > until all ongoing references to the MR are completed. > > Now, since there was confusion about using an MR for a > Send operation, let me clarify. If the client does: > FastReg(payload buffer) > Send(inline buffer) > ... > Recv > LocalInv(payload buffer) > wait for LI completion Not sure what you are describing? Is Recv landing memory for a SEND? In that case it is using a lkey, lkeys are not remotely usable, so it does not need synchronous invalidation. In all cases the LocalInv must only be posted once a CQE for the Recv is observed. If Recv is RDMA WRITE target memory, then it using the rkey and it does does need synchronous invalidation. This must be done once a recv CQE is observed, or optimized by having the other send via one of the _INV operations. In no case can you pipeline a LocalInv into the SQ that would impact RQ activity, even with any of the fences. > Is setting IB_SEND_FENCE on the LocalInv enough to ensure > that the Send is complete? No. There are two fences in the spec, IB_SEND_FENCE is the mandatory one, and it only interacts with RDMA READ and ATOMIC entries. Local Invalidate Fence (the optinal one) also will not order the two because LIF is only defined to order against SQE's that use the MR. Since Send is using the global dma lkey it does not interact with the LocalInv and LIF will not order them. > > No idea on the relative performance of LIF vs doing it manually, but > > the need for one or the other is unambigously clear in the spec. > > It seems to me that the guarantee that the server sees > only one copy of the Send payload is good enough. That > means that by the time Recv completion occurs on the > client, even if the client HCA still thinks it needs to > retransmit the Send containing the RPC Call, the server > ULP has already seen and processed that Send payload, > and the HCA on the server won't deliver that payload a > second time. Yes, that is OK reasoning. > If the only concern about preserving that inline buffer is > guaranteeing that retransmits contain the same content, I > don't think we have a problem. All HCA retransmits of an > RPC Call, until the matching RPC Reply is received on the > client, will contain the same content. Right. > The issue about the HCA not being able to access the inline > buffer during a retransmit is also not an issue for RPC- > over-RDMA because these buffers are always registered with > the local rdma lkey. Exactly. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> I don't understand, is this new with the patch applied? > > I applied your patch to 4.12-rc6 on the initiator, but my targets are > still 4.9.33 since it looked like the patch only affected the > initiator. I did not see this before your patch, but I also didn't try > rebooting the targets multiple times before because of the previous > messages. That sounds like a separate issue. Should we move forward with the suggested patch? >>> After this and a reboot of the target, the initiator would drop the >>> connection after 1.5-2 minutes then faster and faster until it was >>> every 5 seconds. It is almost like it set up the connection then lose >>> the first ping, or the ping wasn't set-up right. I tried rebooting the >>> target multiple times. >> >> >> So the initiator could not recover even after the target as available >> again? > > The initiator recovered the connection when the target came back, but > the connection was not stable. I/O would happen on the connection, > then it would get shaky and then finally disconnect. Then it would > reconnect, pass more I/O, then get shaky and go down again. With the 5 > second disconnects, it would pass traffic for 5 seconds, then as soon > as I saw the ping timeout, the I/O would stop until it reconnected. At > that point it seems that the lack of pings would kill the I/O unlike > earlier where there was a stall in I/O and then the connection would > be torn down. I can try to see if I can get it to happen again. So looks like the target is not responding to NOOP_OUTs (or traffic at all for that matter). The messages: [Tue Jun 20 10:11:20 2017] iSCSI Login timeout on Network Portal [::]:3260 Are indicating that something is stuck in the login thread, not sure where though. Did you see a watchdog popping on a hang? And massage: [Tue Jun 20 10:11:58 2017] isert: isert_print_wc: login send failure: transport retry counter exceeded (12) vend_err 81 Is an indication that the rdma fabric is in some error state. On which reboot attempt all this happened? the first one? Again, CCing target-devel. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jason, >> The issue about the HCA not being able to access the inline >> buffer during a retransmit is also not an issue for RPC- >> over-RDMA because these buffers are always registered with >> the local rdma lkey. > > Exactly. Lost track of the thread... Indeed you raised this issue lots of times before, and I failed to see why its important or why its error prone, but now I do... My apologies for not listening :( We should fix _all_ initiators for it, nvme-rdma, iser, srp and xprtrdma (and probably some more ULPs out there)... It also means that we cannot really suppress any send completions as that would result in an unpredictable latency (which is not acceptable). I wish we could somehow tell the HCA that it can ignore access fail to a specific address when retransmitting.. but maybe its too much to ask... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Jun 27, 2017, at 3:37 AM, Sagi Grimberg <sagi@grimberg.me> wrote: > > Jason, > >>> The issue about the HCA not being able to access the inline >>> buffer during a retransmit is also not an issue for RPC- >>> over-RDMA because these buffers are always registered with >>> the local rdma lkey. >> Exactly. > > Lost track of the thread... > > > Indeed you raised this issue lots of times before, and I failed to see > why its important or why its error prone, but now I do... > > My apologies for not listening :( > > We should fix _all_ initiators for it, nvme-rdma, iser, srp > and xprtrdma (and probably some more ULPs out there)... Go back and browse the end of the thread: there's no need to change xprtrdma, and maybe no need to change the others either. > It also means that we cannot really suppress any send completions as > that would result in an unpredictable latency (which is not acceptable). > > I wish we could somehow tell the HCA that it can ignore access fail to a > specific address when retransmitting.. but maybe its too much to ask... -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Go back and browse the end of the thread: there's no need to change > xprtrdma, and maybe no need to change the others either. I think there is, even with inline, xprtrdma dma maps the immediate buffers, also the message head and tail so unmapping these buffers without waiting for the send completion would trigger a IOMMU access error (the HCA (re)tries to access an already unmapped buffer). -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 27, 2017 at 07:07:08PM +0300, Sagi Grimberg wrote: > > >Go back and browse the end of the thread: there's no need to change > >xprtrdma, and maybe no need to change the others either. > > I think there is, even with inline, xprtrdma dma maps the immediate > buffers, also the message head and tail so unmapping these buffers > without waiting for the send completion would trigger a IOMMU access > error (the HCA (re)tries to access an already unmapped buffer). Yes, that is an excellent observation. When using the local rdma lkey you still need to ensure the linux API DMA map remains until completion. send completion mitigation is still possible, if it is OK for the backing pages to remain, but I think a more sophisticated strategy is needed - eg maybe push some kind of NOP through the send q after a timer or only complete when the last available work is stuffed or something. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Jun 27, 2017, at 12:07 PM, Sagi Grimberg <sagi@grimberg.me> wrote: > > >> Go back and browse the end of the thread: there's no need to change >> xprtrdma, and maybe no need to change the others either. > > I think there is, even with inline, xprtrdma dma maps the immediate > buffers, also the message head and tail so unmapping these buffers > without waiting for the send completion would trigger a IOMMU access > error (the HCA (re)tries to access an already unmapped buffer). Thinking out loud: IIRC the message head and tail reside in the persistently registered and DMA mapped buffers with few exceptions. However, when page cache pages are involved, xprtrdma will do a DMA unmap as you say. So: - we don't have a problem transmitting a garbled request thanks to exactly-once receive semantics - we don't have a problem with the timing of registration and invalidation on the initiator because the PD's DMA lkey is used - we do have a problem with DMA unmap Using only persistently registered and DMA mapped Send buffers should avoid the need to signal all Sends. However, when page cache pages are involved, then the Send needs to be signaled, and the pages unmapped only after Send completion, to be completely safe. Or, the initiator should just use RDMA Read and Write, and stick with small inline sizes. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2017-06-27 at 10:37 +0300, Sagi Grimberg wrote: > Jason, > > > > The issue about the HCA not being able to access the inline > > > buffer during a retransmit is also not an issue for RPC- > > > over-RDMA because these buffers are always registered with > > > the local rdma lkey. > > > > Exactly. > > Lost track of the thread... > > > Indeed you raised this issue lots of times before, and I failed to see > why its important or why its error prone, but now I do... > > My apologies for not listening :( > > We should fix _all_ initiators for it, nvme-rdma, iser, srp > and xprtrdma (and probably some more ULPs out there)... > > It also means that we cannot really suppress any send completions as > that would result in an unpredictable latency (which is not acceptable). > > I wish we could somehow tell the HCA that it can ignore access fail to a > specific address when retransmitting.. but maybe its too much to ask... Hello Sagi, Can you clarify why you think that the SRP initiator needs to be changed? The SRP initiator submits the local invalidate work request after the RDMA write request. According to table 79 "Work Request Operation Ordering" the order of these work requests must be maintained by the HCA. I think if a HCA would start with invalidating the MR before the remote HCA has acknowledged the written data that that's a firmware bug. The upstream SRP initiator does not use inline data. Bart.-- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 27, 2017 at 06:08:57PM +0000, Bart Van Assche wrote: > Can you clarify why you think that the SRP initiator needs to be changed? > The SRP initiator submits the local invalidate work request after the RDMA > write request. According to table 79 "Work Request Operation > Ordering" the That table has a 'L' for invalidate that follows RDMA Write. L means they are not ordered unless the optional Local Invalidate Fence mode is used. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> send completion mitigation is still possible, if it is OK for the > backing pages to remain, but I think a more sophisticated strategy is > needed - eg maybe push some kind of NOP through the send q after a > timer or only complete when the last available work is stuffed or > something. I'm not smart enough to come up with something like that. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Thinking out loud: > > IIRC the message head and tail reside in the persistently registered > and DMA mapped buffers with few exceptions. > > However, when page cache pages are involved, xprtrdma will do a DMA > unmap as you say. > > So: > - we don't have a problem transmitting a garbled request thanks to > exactly-once receive semantics > - we don't have a problem with the timing of registration and > invalidation on the initiator because the PD's DMA lkey is used > - we do have a problem with DMA unmap > > Using only persistently registered and DMA mapped Send buffers > should avoid the need to signal all Sends. However, when page > cache pages are involved, then the Send needs to be signaled, > and the pages unmapped only after Send completion, to be completely > safe. How do you know when that happens? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Hello Sagi, Hi Bart, > Can you clarify why you think that the SRP initiator needs to be changed? > The SRP initiator submits the local invalidate work request after the RDMA > write request. According to table 79 "Work Request Operation Ordering" the > order of these work requests must be maintained by the HCA. I think if a HCA > would start with invalidating the MR before the remote HCA has acknowledged > the written data that that's a firmware bug. That flow is fine, we were discussing immediate data sends. SRP only needs fixing by waiting for the all local invalidates to complete before unmapping the user buffers and completing the I/O. BTW, did the efforts on standardizing remote invalidate in SRP ever evolved to something? Would it be acceptable to add a non-standard ib_srp and ib_srpt? We can default to off and allow the user to opt it in if it knows the two sides comply... We need to fix that in nvmf and iser too btw (luckily we have remote invalidate so its not a big issue). > The upstream SRP initiator does not use inline data. Yet :) IIRC you have a branch with immediate-data support against scst so it might be relevant there... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2017-06-28 at 10:16 +0300, Sagi Grimberg wrote: > BTW, did the efforts on standardizing remote invalidate in SRP ever > evolved to something? Would it be acceptable to add a non-standard > ib_srp and ib_srpt? We can default to off and allow the user to opt > it in if it knows the two sides comply... I'd like to hear Doug's opinion about whether it's OK to add this feature to the upstream drivers before support for remote invalidate is standardized. > We need to fix that in nvmf and iser too btw (luckily we have remote > invalidate so its not a big issue). > > > The upstream SRP initiator does not use inline data. > > Yet :) > > IIRC you have a branch with immediate-data support against scst so > it might be relevant there... Support for immediate data is being standardized. My plan is to add support for immediate data to the upstream drivers once the T10 committee agrees about the implementation. See also http://www.t10.org/cgi-bin/ac.pl?t=f&f=srp2r02a.pdf Bart.-- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Jun 28, 2017, at 3:08 AM, Sagi Grimberg <sagi@grimberg.me> wrote: > > >> Thinking out loud: >> IIRC the message head and tail reside in the persistently registered >> and DMA mapped buffers with few exceptions. >> However, when page cache pages are involved, xprtrdma will do a DMA >> unmap as you say. >> So: >> - we don't have a problem transmitting a garbled request thanks to >> exactly-once receive semantics >> - we don't have a problem with the timing of registration and >> invalidation on the initiator because the PD's DMA lkey is used >> - we do have a problem with DMA unmap >> Using only persistently registered and DMA mapped Send buffers >> should avoid the need to signal all Sends. However, when page >> cache pages are involved, then the Send needs to be signaled, >> and the pages unmapped only after Send completion, to be completely >> safe. > > How do you know when that happens? The RPC Call send path sets up the Send SGE array. If it includes page cache pages, it can set IB_SEND_SIGNALED. The SGE array and the ib_cqe for the send are in the same data structure, so the Send completion handler can find the SGE array and figure out what needs to be unmapped. The only problem is if a POSIX signal fires. In that case the data structure can be released before the Send completion fires, and we get touch-after-free in the completion handler. I'm thinking that it just isn't going to be practical to handle unmapping this way, and I should just revert back to using RDMA Read instead of adding page cache pages to the Send SGE. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> How do you know when that happens? > > The RPC Call send path sets up the Send SGE array. If it includes > page cache pages, it can set IB_SEND_SIGNALED. > > The SGE array and the ib_cqe for the send are in the same data > structure, so the Send completion handler can find the SGE array > and figure out what needs to be unmapped. > > The only problem is if a POSIX signal fires. In that case the > data structure can be released before the Send completion fires, > and we get touch-after-free in the completion handler. > > I'm thinking that it just isn't going to be practical to handle > unmapping this way, and I should just revert back to using RDMA > Read instead of adding page cache pages to the Send SGE. Or wait for the send completion before completing the I/O? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Jun 29, 2017, at 1:35 AM, Sagi Grimberg <sagi@grimberg.me> wrote: > > >>> How do you know when that happens? >> The RPC Call send path sets up the Send SGE array. If it includes >> page cache pages, it can set IB_SEND_SIGNALED. >> The SGE array and the ib_cqe for the send are in the same data >> structure, so the Send completion handler can find the SGE array >> and figure out what needs to be unmapped. >> The only problem is if a POSIX signal fires. In that case the >> data structure can be released before the Send completion fires, >> and we get touch-after-free in the completion handler. >> I'm thinking that it just isn't going to be practical to handle >> unmapping this way, and I should just revert back to using RDMA >> Read instead of adding page cache pages to the Send SGE. > > Or wait for the send completion before completing the I/O? In the normal case, that works. If a POSIX signal occurs (^C, RPC timeout), the RPC exits immediately and recovers all resources. The Send can still be running at that point, and it can't be stopped (without transitioning the QP to error state, I guess). The alternative is reference-counting the data structure that has the ib_cqe and the SGE array. That adds one or more atomic_t operations per I/O that I'd like to avoid. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Or wait for the send completion before completing the I/O? > > In the normal case, that works. > > If a POSIX signal occurs (^C, RPC timeout), the RPC exits immediately > and recovers all resources. The Send can still be running at that > point, and it can't be stopped (without transitioning the QP to > error state, I guess). In that case we can't complete the I/O either (or move the QP into error state), we need to defer/sleep on send completion. > The alternative is reference-counting the data structure that has > the ib_cqe and the SGE array. That adds one or more atomic_t > operations per I/O that I'd like to avoid. Why atomics? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Jul 2, 2017, at 5:45 AM, Sagi Grimberg <sagi@grimberg.me> wrote: > > >>> Or wait for the send completion before completing the I/O? >> In the normal case, that works. >> If a POSIX signal occurs (^C, RPC timeout), the RPC exits immediately >> and recovers all resources. The Send can still be running at that >> point, and it can't be stopped (without transitioning the QP to >> error state, I guess). > > In that case we can't complete the I/O either (or move the > QP into error state), we need to defer/sleep on send completion. Unfortunately the RPC client finite state machine mustn't sleep when a POSIX signal fires. xprtrdma has to unblock the waiting application process but clean up the resources asynchronously. The RPC completion doesn't have to wait on DMA unmapping the send buffer. What would have to wait is cleaning up the resources -- in particular, allowing the rpcrdma_req structure, where the send SGEs are kept, to be re-used. In the current design, both happen at the same time. >> The alternative is reference-counting the data structure that has >> the ib_cqe and the SGE array. That adds one or more atomic_t >> operations per I/O that I'd like to avoid. > > Why atomics? Either an atomic reference count or a spin lock is necessary because there are two different ways an RPC can exit: 1. The common way, which is through receipt of an RPC reply, handled by rpcrdma_reply_handler. 2. POSIX signal, where the RPC reply races with the wake-up of the application process (in other words, the reply can still arrive while the RPC is terminating). In both cases, the RPC client has to invalidate any registered memory, and it has to be done no more and no less than once. I deal with some of this in my for-13 patches: http://marc.info/?l=linux-nfs&m=149693711119727&w=2 The first seven patches handle the race condition and the need for exactly-once invalidation. But the issue with unmapping the Send buffers has to do with how the Send SGEs are managed. The data structure containing the SGEs goes away once the RPC is complete. So there are two "users": one is the RPC completion, and one is the Send completion. Once both are done, the data structure can be released. But RPC completion can't wait if the Send completion hasn't yet fired. I could kmalloc the SGE array instead, signal each Send, and then in the Send completion handler, unmap the SGEs and then kfree the SGE array. That's a lot of overhead. Or I could revert all the "map page cache pages" logic and just use memcpy for small NFS WRITEs, and RDMA the rest of the time. That keeps everything simple, but means large inline thresholds can't use send-in-place. I'm still open to suggestion. for-4.14 will deal with other problems, unless an obvious and easy fix arises. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jul 02, 2017 at 02:17:52PM -0400, Chuck Lever wrote: > I could kmalloc the SGE array instead, signal each Send, > and then in the Send completion handler, unmap the SGEs > and then kfree the SGE array. That's a lot of overhead. Usually after allocating the send queue you'd pre-allocate all the tracking memory needed for each SQE - eg enough information to do the dma unmaps/etc? > Or I could revert all the "map page cache pages" logic and > just use memcpy for small NFS WRITEs, and RDMA the rest of > the time. That keeps everything simple, but means large > inline thresholds can't use send-in-place. Don't you have the same problem with RDMA WRITE? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Jul 9, 2017, at 12:47 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > > On Sun, Jul 02, 2017 at 02:17:52PM -0400, Chuck Lever wrote: > >> I could kmalloc the SGE array instead, signal each Send, >> and then in the Send completion handler, unmap the SGEs >> and then kfree the SGE array. That's a lot of overhead. > > Usually after allocating the send queue you'd pre-allocate all the > tracking memory needed for each SQE - eg enough information to do the > dma unmaps/etc? Right. In xprtrdma, the QP resources are allocated in rpcrdma_ep_create. For every RPC-over-RDMA credit, rpcrdma_buffer_create allocates an rpcrdma_req structure, which contains an ib_cqe and an array of SGEs for the Send, and a number of other resources used to maintain registration state during an RPC-over-RDMA call. Both of these functions are invoked during transport instance set-up. The problem is the lifetime for the rpcrdma_req structure. Currently it is acquired when an RPC is started, and it is released when the RPC terminates. Inline send buffers are never unmapped until transport tear-down, but since: commit 655fec6987be05964e70c2e2efcbb253710e282f Author: Chuck Lever <chuck.lever@oracle.com> AuthorDate: Thu Sep 15 10:57:24 2016 -0400 Commit: Anna Schumaker <Anna.Schumaker@Netapp.com> CommitDate: Mon Sep 19 13:08:38 2016 -0400 xprtrdma: Use gathered Send for large inline messages Part of the Send payload can come from page cache pages for NFS WRITE and NFS SYMLINK operations. Send buffers that are page cache pages are DMA unmapped when rpcrdma_req is released. IIUC what Sagi found is that Send WRs can continue running even after an RPC completes in certain pathological cases. Therefore the Send WR can complete after the rpcrdma_req is released and page cache-related Send buffers have been unmapped. It's not an issue to make the RPC reply handler wait for Send completion. In most cases, this is not going to add any additional latency because the Send will complete long before the RPC reply arrives. By far the common case, and that's an extra completion interrupt for nothing. The problem arises if the RPC is terminated locally before the reply arrives. Suppose, for example, user hits ^C, or a timer fires. Then the rpcrdma_req can be released and re-used before the Send completes. There's no way to make RPC completion wait for Send completion. One option is to somehow split the Send-related data structures from rpcrdma_req, and manage them independently. I've already done that for MRs: MR state is now located in rpcrdma_mw. If instead I just never DMA map page cache pages, then all Send buffers are always left DMA mapped while the transport is active. There's no problem there with Send retransmits. The overhead is that I have to either copy data into the Send buffers, or force the server to use RDMA Read, which has a palpable overhead. >> Or I could revert all the "map page cache pages" logic and >> just use memcpy for small NFS WRITEs, and RDMA the rest of >> the time. That keeps everything simple, but means large >> inline thresholds can't use send-in-place. > > Don't you have the same problem with RDMA WRITE? The server side initiates RDMA Writes. The final RDMA Write in a WR chain is signaled, but a subsequent Send completion is used to determine when the server may release resources used for the Writes. We're already doing it the slow way there, and there's no ^C hazard on the server. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 10, 2017 at 03:03:18PM -0400, Chuck Lever wrote: > One option is to somehow split the Send-related data structures from > rpcrdma_req, and manage them independently. I've already done that for > MRs: MR state is now located in rpcrdma_mw. Yes, this is is what I was implying.. Track the SQE related stuff seperately in memory allocated during SQ setup - MR, dma maps, etc. No need for an atomic/lock then, right? The required memory is bounded since the inline send depth is bounded. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Jul 10, 2017, at 4:05 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > > On Mon, Jul 10, 2017 at 03:03:18PM -0400, Chuck Lever wrote: > >> One option is to somehow split the Send-related data structures from >> rpcrdma_req, and manage them independently. I've already done that for >> MRs: MR state is now located in rpcrdma_mw. > > Yes, this is is what I was implying.. Track the SQE related stuff > seperately in memory allocated during SQ setup - MR, dma maps, etc. > No need for an atomic/lock then, right? The required memory is bounded > since the inline send depth is bounded. Perhaps I lack some imagination, but I don't see how I can manage these small objects without a serialized free list or circular array that would be accessed in the forward path and also in a Send completion handler. And I would still have to signal all Sends, which is extra interrupts and context switches. This seems like a lot of overhead to deal with a very uncommon case. I can reduce this overhead by signaling only Sends that need to unmap page cache pages, but still. But I also realized that Send Queue accounting can be broken by a delayed Send completion. As we previously discussed, xprtrdma does SQ accounting using RPC completion as the gate. Basically xprtrdma will send another RPC as soon as a previous one is terminated. If the Send WR is still running when the RPC terminates, I can potentially overrun the Send Queue. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 10, 2017 at 04:51:20PM -0400, Chuck Lever wrote: > > > On Jul 10, 2017, at 4:05 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > > > > On Mon, Jul 10, 2017 at 03:03:18PM -0400, Chuck Lever wrote: > > > >> One option is to somehow split the Send-related data structures from > >> rpcrdma_req, and manage them independently. I've already done that for > >> MRs: MR state is now located in rpcrdma_mw. > > > > Yes, this is is what I was implying.. Track the SQE related stuff > > seperately in memory allocated during SQ setup - MR, dma maps, etc. > > > No need for an atomic/lock then, right? The required memory is bounded > > since the inline send depth is bounded. > > Perhaps I lack some imagination, but I don't see how I can manage > these small objects without a serialized free list or circular > array that would be accessed in the forward path and also in a > Send completion handler. I don't get it, dma unmap can only ever happen in the send completion handler, it can never happen in the forward path. (this is the whole point of this thread) Since you are not using send completion today you can just use the wr_id to point to the pre-allocated memory containing the pages to invalidate. Completely remove dma unmap from the forward path. Usually I work things out so that the meta-data array is a ring and every SQE post consumes a meta-data entry. Then occasionally I signal completion and provide a wr_id of the latest ring index and the completion handler runs through all the accumulated meta-data and acts on it (eg unmaps/etc). This approach still allows batching completions. Since ring entries are bounded size we just preallocate the largest size at QP creation. In this case it is some multiple of the number of inline send pages * number of SQE entries. > This seems like a lot of overhead to deal with a very uncommon > case. I can reduce this overhead by signaling only Sends that > need to unmap page cache pages, but still. Yes, but it is not avoidable.. > As we previously discussed, xprtrdma does SQ accounting using RPC > completion as the gate. Basically xprtrdma will send another RPC > as soon as a previous one is terminated. If the Send WR is still > running when the RPC terminates, I can potentially overrun the > Send Queue. Makes sense. The SQ accounting must be precise. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 10, 2017 at 03:03:18PM -0400, Chuck Lever wrote: > >> Or I could revert all the "map page cache pages" logic and > >> just use memcpy for small NFS WRITEs, and RDMA the rest of > >> the time. That keeps everything simple, but means large > >> inline thresholds can't use send-in-place. > > > > Don't you have the same problem with RDMA WRITE? > > The server side initiates RDMA Writes. The final RDMA Write in a WR > chain is signaled, but a subsequent Send completion is used to > determine when the server may release resources used for the Writes. > We're already doing it the slow way there, and there's no ^C hazard > on the server. Wait, I guess I meant RDMA READ path. The same contraints apply to RKeys as inline send - you cannot DMA unmap rkey memory until the rkey is invalidated at the HCA. So posting an invalidate SQE and then immediately unmapping the DMA pages is bad too.. No matter how the data is transfered the unmapping must follow the same HCA synchronous model.. DMA unmap must only be done from the send completion handler (inline send or invalidate rkey), from the recv completion handler (send with invalidate), or from QP error state teardown. Anything that does DMA memory unmap from another thread is very, very suspect, eg async from a ctrl-c trigger event. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Jul 10, 2017, at 5:24 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > > On Mon, Jul 10, 2017 at 03:03:18PM -0400, Chuck Lever wrote: > >>>> Or I could revert all the "map page cache pages" logic and >>>> just use memcpy for small NFS WRITEs, and RDMA the rest of >>>> the time. That keeps everything simple, but means large >>>> inline thresholds can't use send-in-place. >>> >>> Don't you have the same problem with RDMA WRITE? >> >> The server side initiates RDMA Writes. The final RDMA Write in a WR >> chain is signaled, but a subsequent Send completion is used to >> determine when the server may release resources used for the Writes. >> We're already doing it the slow way there, and there's no ^C hazard >> on the server. > > Wait, I guess I meant RDMA READ path. > > The same contraints apply to RKeys as inline send - you cannot DMA > unmap rkey memory until the rkey is invalidated at the HCA. > > So posting an invalidate SQE and then immediately unmapping the DMA > pages is bad too.. > > No matter how the data is transfered the unmapping must follow the > same HCA synchronous model.. DMA unmap must only be done from the send > completion handler (inline send or invalidate rkey), from the recv > completion handler (send with invalidate), or from QP error state teardown. > > Anything that does DMA memory unmap from another thread is very, very > suspect, eg async from a ctrl-c trigger event. 4.13 server side is converted to use the rdma_rw API for handling RDMA Read. For non-iWARP cases, it's using the local DMA key for Read sink buffers. For iWARP it should be using Read-with-invalidate (IIRC). -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 10, 2017 at 05:29:53PM -0400, Chuck Lever wrote: > > > On Jul 10, 2017, at 5:24 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > > > > On Mon, Jul 10, 2017 at 03:03:18PM -0400, Chuck Lever wrote: > > > >>>> Or I could revert all the "map page cache pages" logic and > >>>> just use memcpy for small NFS WRITEs, and RDMA the rest of > >>>> the time. That keeps everything simple, but means large > >>>> inline thresholds can't use send-in-place. > >>> > >>> Don't you have the same problem with RDMA WRITE? > >> > >> The server side initiates RDMA Writes. The final RDMA Write in a WR > >> chain is signaled, but a subsequent Send completion is used to > >> determine when the server may release resources used for the Writes. > >> We're already doing it the slow way there, and there's no ^C hazard > >> on the server. > > > > Wait, I guess I meant RDMA READ path. > > > > The same contraints apply to RKeys as inline send - you cannot DMA > > unmap rkey memory until the rkey is invalidated at the HCA. > > > > So posting an invalidate SQE and then immediately unmapping the DMA > > pages is bad too.. > > > > No matter how the data is transfered the unmapping must follow the > > same HCA synchronous model.. DMA unmap must only be done from the send > > completion handler (inline send or invalidate rkey), from the recv > > completion handler (send with invalidate), or from QP error state teardown. > > > > Anything that does DMA memory unmap from another thread is very, very > > suspect, eg async from a ctrl-c trigger event. > > 4.13 server side is converted to use the rdma_rw API for > handling RDMA Read. For non-iWARP cases, it's using the > local DMA key for Read sink buffers. For iWARP it should > be using Read-with-invalidate (IIRC). The server sounds fine, how does the client work? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Jul 10, 2017, at 5:32 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > > On Mon, Jul 10, 2017 at 05:29:53PM -0400, Chuck Lever wrote: >> >>> On Jul 10, 2017, at 5:24 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: >>> >>> On Mon, Jul 10, 2017 at 03:03:18PM -0400, Chuck Lever wrote: >>> >>>>>> Or I could revert all the "map page cache pages" logic and >>>>>> just use memcpy for small NFS WRITEs, and RDMA the rest of >>>>>> the time. That keeps everything simple, but means large >>>>>> inline thresholds can't use send-in-place. >>>>> >>>>> Don't you have the same problem with RDMA WRITE? >>>> >>>> The server side initiates RDMA Writes. The final RDMA Write in a WR >>>> chain is signaled, but a subsequent Send completion is used to >>>> determine when the server may release resources used for the Writes. >>>> We're already doing it the slow way there, and there's no ^C hazard >>>> on the server. >>> >>> Wait, I guess I meant RDMA READ path. >>> >>> The same contraints apply to RKeys as inline send - you cannot DMA >>> unmap rkey memory until the rkey is invalidated at the HCA. >>> >>> So posting an invalidate SQE and then immediately unmapping the DMA >>> pages is bad too.. >>> >>> No matter how the data is transfered the unmapping must follow the >>> same HCA synchronous model.. DMA unmap must only be done from the send >>> completion handler (inline send or invalidate rkey), from the recv >>> completion handler (send with invalidate), or from QP error state teardown. >>> >>> Anything that does DMA memory unmap from another thread is very, very >>> suspect, eg async from a ctrl-c trigger event. >> >> 4.13 server side is converted to use the rdma_rw API for >> handling RDMA Read. For non-iWARP cases, it's using the >> local DMA key for Read sink buffers. For iWARP it should >> be using Read-with-invalidate (IIRC). > > The server sounds fine, how does the client work? The client does not initiate RDMA Read or Write today. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 10, 2017 at 06:04:18PM -0400, Chuck Lever wrote: > > The server sounds fine, how does the client work? > > The client does not initiate RDMA Read or Write today. Right, but it provides an rkey that the server uses for READ or WRITE. The invalidate of that rkey at the client must follow the same rules as inline send. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Jul 10, 2017, at 6:09 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > > On Mon, Jul 10, 2017 at 06:04:18PM -0400, Chuck Lever wrote: > >>> The server sounds fine, how does the client work? >> >> The client does not initiate RDMA Read or Write today. > > Right, but it provides an rkey that the server uses for READ or WRITE. > > The invalidate of that rkey at the client must follow the same rules > as inline send. Ah, I see. The RPC reply handler calls frwr_op_unmap_sync to invalidate any MRs associated with the RPC. frwr_op_unmap_sync has to sort the rkeys that are remotely invalidated, and those that have not been. The first step is to ensure all the rkeys for an RPC are invalid. The rkey that was remotely invalidated is skipped here, and a chain of LocalInv WRs is posted to invalidate any remaining rkeys. The last WR in the chain is signaled. If one or more LocalInv WRs are posted, this function waits for LocalInv completion. The last step is always DMA unmapping. Note that we can't get a completion for a remotely invalidated rkey, and we have to wait for LocalInv to complete anyway. So the DMA unmapping is always handled here instead of in a completion handler. When frwr_op_unmap_sync returns to the RPC reply handler, the handler calls xprt_complete_rqst, and the RPC is terminated. This guarantees that the MRs are invalid before control is returned to the RPC consumer. In the ^C case, frwr_op_unmap_safe is invoked during RPC termination. The MRs are passed to the background recovery task, which invokes frwr_op_recover_mr. frwr_op_recover_mr destroys the fr_mr and DMA unmaps the memory. (It's also used when registration or invalidation flushes, which is why it uses a hammer). So here, we're a little fast/loose: the ordering of invalidation and unmapping is correct, but the MRs can be invalidated after the RPC completes. Since RPC termination can't wait, this is the best I can do for now. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/10/2017 11:57 PM, Chuck Lever wrote: > >> On Jul 10, 2017, at 6:09 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: >> >> On Mon, Jul 10, 2017 at 06:04:18PM -0400, Chuck Lever wrote: >> >>>> The server sounds fine, how does the client work? >>> >>> The client does not initiate RDMA Read or Write today. >> >> Right, but it provides an rkey that the server uses for READ or WRITE. >> >> The invalidate of that rkey at the client must follow the same rules >> as inline send. > > Ah, I see. > > The RPC reply handler calls frwr_op_unmap_sync to invalidate > any MRs associated with the RPC. > > frwr_op_unmap_sync has to sort the rkeys that are remotely > invalidated, and those that have not been. Does the reply handler consider the possibility that the reply is being signaled before the send WRs? There are some really interesting races on shared or multiple CQs when the completion upcalls start to back up under heavy load that we've seen in Windows SMB Direct. In the end, we had to put explicit reference counts on each and every object, and added rundown references to everything before completing an operation and signaling the upper layer (SMB3, in our case). This found a surprising number of double completions, and missing completions from drivers as well. > The first step is to ensure all the rkeys for an RPC are > invalid. The rkey that was remotely invalidated is skipped > here, and a chain of LocalInv WRs is posted to invalidate > any remaining rkeys. The last WR in the chain is signaled. > > If one or more LocalInv WRs are posted, this function waits > for LocalInv completion. > > The last step is always DMA unmapping. Note that we can't > get a completion for a remotely invalidated rkey, and we > have to wait for LocalInv to complete anyway. So the DMA > unmapping is always handled here instead of in a > completion handler. > > When frwr_op_unmap_sync returns to the RPC reply handler, > the handler calls xprt_complete_rqst, and the RPC is > terminated. This guarantees that the MRs are invalid before > control is returned to the RPC consumer. > > > In the ^C case, frwr_op_unmap_safe is invoked during RPC > termination. The MRs are passed to the background recovery > task, which invokes frwr_op_recover_mr. That worries me. How do you know it's going in sequence, and that it will result in an invalidated MR? > frwr_op_recover_mr destroys the fr_mr and DMA unmaps the > memory. (It's also used when registration or invalidation > flushes, which is why it uses a hammer). > > So here, we're a little fast/loose: the ordering of > invalidation and unmapping is correct, but the MRs can be > invalidated after the RPC completes. Since RPC termination > can't wait, this is the best I can do for now. That would worry me even more. "fast/loose" isn't a good situation when storage is concerned. Shouldn't you just be closing the connection? Tom. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Tom- > On Jul 11, 2017, at 9:23 AM, Tom Talpey <tom@talpey.com> wrote: > > On 7/10/2017 11:57 PM, Chuck Lever wrote: >>> On Jul 10, 2017, at 6:09 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: >>> >>> On Mon, Jul 10, 2017 at 06:04:18PM -0400, Chuck Lever wrote: >>> >>>>> The server sounds fine, how does the client work? >>>> >>>> The client does not initiate RDMA Read or Write today. >>> >>> Right, but it provides an rkey that the server uses for READ or WRITE. >>> >>> The invalidate of that rkey at the client must follow the same rules >>> as inline send. >> Ah, I see. >> The RPC reply handler calls frwr_op_unmap_sync to invalidate >> any MRs associated with the RPC. >> frwr_op_unmap_sync has to sort the rkeys that are remotely >> invalidated, and those that have not been. > > Does the reply handler consider the possibility that the reply is > being signaled before the send WRs? There are some really interesting > races on shared or multiple CQs when the completion upcalls start > to back up under heavy load that we've seen in Windows SMB Direct. If I understand you correctly, that's exactly what we're discussing. The Send WR that transmitted the RPC Call can outlive the RPC transaction in rare cases. A partial solution is to move the Send SGE array into a data structure whose lifetime is independent of rpcrdma_req. That allows the client to guarantee that appropriate DMA unmaps are done only after the Send is complete. The other part of this issue is that delayed Send completion also needs to prevent initiation of another RPC, otherwise the Send Queue can overflow or the client might exceed the credit grant on this connection. This part worries me because it could mean that some serialization (eg. a completion and context switch) is needed for every Send operation on the client. > In the end, we had to put explicit reference counts on each and > every object, and added rundown references to everything before > completing an operation and signaling the upper layer (SMB3, in > our case). This found a surprising number of double completions, > and missing completions from drivers as well. > >> The first step is to ensure all the rkeys for an RPC are >> invalid. The rkey that was remotely invalidated is skipped >> here, and a chain of LocalInv WRs is posted to invalidate >> any remaining rkeys. The last WR in the chain is signaled. >> If one or more LocalInv WRs are posted, this function waits >> for LocalInv completion. >> The last step is always DMA unmapping. Note that we can't >> get a completion for a remotely invalidated rkey, and we >> have to wait for LocalInv to complete anyway. So the DMA >> unmapping is always handled here instead of in a >> completion handler. >> When frwr_op_unmap_sync returns to the RPC reply handler, >> the handler calls xprt_complete_rqst, and the RPC is >> terminated. This guarantees that the MRs are invalid before >> control is returned to the RPC consumer. >> In the ^C case, frwr_op_unmap_safe is invoked during RPC >> termination. The MRs are passed to the background recovery >> task, which invokes frwr_op_recover_mr. > > That worries me. How do you know it's going in sequence, > and that it will result in an invalidated MR? The MR is destroyed synchronously. Isn't that enough to guarantee that the rkey is invalid and DMA unmapping is safe to proceed? With FRWR, if a LocalInv flushes with "memory management error" there's no way to invalidate that MR other than: - destroy the MR, or - disconnect The latter is a mighty sledgehammer that affects all the live MRs and RPCs on that connection. That's why recover_mr destroys the MR in all cases. >> frwr_op_recover_mr destroys the fr_mr and DMA unmaps the >> memory. (It's also used when registration or invalidation >> flushes, which is why it uses a hammer). >> So here, we're a little fast/loose: the ordering of >> invalidation and unmapping is correct, but the MRs can be >> invalidated after the RPC completes. Since RPC termination >> can't wait, this is the best I can do for now. > > That would worry me even more. "fast/loose" isn't a good > situation when storage is concerned. I agree! > Shouldn't you just be closing the connection? We'd have to wait for the connection close to complete in a function that is not allowed to sleep, so it would have to be done in the background as well. This is no better than handing the MR to a background process. And if possible we'd like to avoid connection loss (see above). I don't like this situation, but it's the best I can do with the current architecture of the RPC client. I've been staring at this for a couple of years now wondering how to fix it. Suggestions happily accepted! -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c index 12ed62ce9ff7..2a07692007bd 100644 --- a/drivers/infiniband/ulp/iser/iser_initiator.c +++ b/drivers/infiniband/ulp/iser/iser_initiator.c @@ -137,8 +137,10 @@ iser_prepare_write_cmd(struct iscsi_task *task, if (unsol_sz < edtl) { hdr->flags |= ISER_WSV; - hdr->write_stag = cpu_to_be32(mem_reg->rkey); - hdr->write_va = cpu_to_be64(mem_reg->sge.addr + unsol_sz); + if (buf_out->data_len > imm_sz) { + hdr->write_stag = cpu_to_be32(mem_reg->rkey); + hdr->write_va = cpu_to_be64(mem_reg->sge.addr + unsol_sz); + } iser_dbg("Cmd itt:%d, WRITE tags, RKEY:%#.4X "