Message ID | 20180917213554.987-6-bvanassche@acm.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Make ABORT and LUN RESET handling synchronous | expand |
On Mon, Sep 17, 2018 at 02:35:42PM -0700, Bart Van Assche wrote: > A session must only be released after all code that accesses the session > structure has finished. Make sure that this is the case by introducing a > new command counter per session that is only decremented after the > .release_cmd() callback has finished. Can you explain what problems we are running into right now due to the lack of a refcount? > @@ -2897,6 +2907,8 @@ void target_sess_cmd_list_set_waiting(struct se_session *se_sess) > spin_lock_irqsave(&se_sess->sess_cmd_lock, flags); > se_sess->sess_tearing_down = 1; > spin_unlock_irqrestore(&se_sess->sess_cmd_lock, flags); > + > + percpu_ref_kill_and_confirm(&se_sess->cmd_count, NULL); This is equivalent ot a plain percpu_ref_kill call. > @@ -2911,17 +2923,14 @@ void target_wait_for_sess_cmds(struct se_session *se_sess) > > WARN_ON_ONCE(!se_sess->sess_tearing_down); > > - spin_lock_irq(&se_sess->sess_cmd_lock); > do { > - ret = wait_event_interruptible_lock_irq_timeout( > - se_sess->cmd_list_wq, > - list_empty(&se_sess->sess_cmd_list), > - se_sess->sess_cmd_lock, 180 * HZ); > + ret = wait_event_timeout(se_sess->cmd_list_wq, > + percpu_ref_is_zero(&se_sess->cmd_count), > + 180 * HZ); So this is basically the big change - check for a zero reference instead of the list_empty. I fail to see how this makes a difference, and also why we even need a percpu ref as the get/pull calls are under, or right next to sess_cmd_lock critical sections.
On Mon, 2018-09-17 at 14:35 -0700, Bart Van Assche wrote: > A session must only be released after all code that accesses the session > structure has finished. Make sure that this is the case by introducing a > new command counter per session that is only decremented after the > .release_cmd() callback has finished. > > Signed-off-by: Bart Van Assche <bvanassche@acm.org> > Cc: Nicholas Bellinger <nab@linux-iscsi.org> > Cc: Mike Christie <mchristi@redhat.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Hannes Reinecke <hare@suse.de> > --- > drivers/target/target_core_transport.c | 27 +++++++++++++++++--------- > include/target/target_core_base.h | 1 + > 2 files changed, 19 insertions(+), 9 deletions(-) > Per HCH's earlier comment. Is this required to fix a regression introduced by commit 00d909a107 in mainline..? Is it SRPT specific..? Btw, I'm not completely against moving to a se_sess percpu_ref for tracking se_cmd association to a session for the purposes of active I/O shutdown. However, inter-mixing it with existing se_sess->sess_cmd_lock + se_sess->sess_cmd_list for the single benefit of dropping ->sess_cmd_lock during non fast-path code in target_wait_for_sess_cmd() doesn't really make sense. That said, what would be better is utilize a new se_sess percpu_ref that eliminates se_sess->sess_cmd_lock + se_sess->sess_cmd_list all-together! How..? Well, what about using the newly sbitmap converted (thanks willy) se_sess->sess_tag_pool instead..? Similar to how blk_mq_queue_tag_busy_iter() uses sbitmap_for_each_set(), the same could be done to eliminate se_sess->sess_cmd_list all together. However, that would depend on all upstream fabric drivers using se_sess->sess_tag_pool. This is true in all but one case since 4.6.x, the SRPT driver. In addition, there would likely be some hairy TMR ABORT_TASK issues to resolve no doubt, but killing se_sess->sess_cmd_[lock,list] and moving to per se_session percpu_ref counting is a good goal. Anyways, I'll review this patch as a starting point to remove se_sess->sess_cmd_[lock,list] all-together. Comments inline below. > diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c > index 94e9d03af99d..54ccd8f56a57 100644 > --- a/drivers/target/target_core_transport.c > +++ b/drivers/target/target_core_transport.c > @@ -224,6 +224,13 @@ void transport_subsystem_check_init(void) > sub_api_initialized = 1; > } > > +static void target_release_sess_cmd_refcnt(struct percpu_ref *ref) > +{ > + struct se_session *sess = container_of(ref, typeof(*sess), cmd_count); > + > + wake_up(&sess->cmd_list_wq); > +} > + > /** > * transport_init_session - initialize a session object > * @se_sess: Session object pointer. > @@ -237,7 +244,8 @@ int transport_init_session(struct se_session *se_sess) > INIT_LIST_HEAD(&se_sess->sess_cmd_list); > spin_lock_init(&se_sess->sess_cmd_lock); > init_waitqueue_head(&se_sess->cmd_list_wq); > - return 0; > + return percpu_ref_init(&se_sess->cmd_count, > + target_release_sess_cmd_refcnt, 0, GFP_KERNEL); > } > EXPORT_SYMBOL(transport_init_session); > > @@ -587,6 +595,7 @@ void transport_free_session(struct se_session *se_sess) > sbitmap_queue_free(&se_sess->sess_tag_pool); > kvfree(se_sess->sess_cmd_map); > } > + percpu_ref_exit(&se_sess->cmd_count); > kmem_cache_free(se_sess_cache, se_sess); > } > EXPORT_SYMBOL(transport_free_session); > @@ -2730,6 +2739,7 @@ int target_get_sess_cmd(struct se_cmd *se_cmd, bool ack_kref) > } > se_cmd->transport_state |= CMD_T_PRE_EXECUTE; > list_add_tail(&se_cmd->se_cmd_list, &se_sess->sess_cmd_list); > + percpu_ref_get(&se_sess->cmd_count); > out: > spin_unlock_irqrestore(&se_sess->sess_cmd_lock, flags); > This would need to be a percpu_ref_tryget_live() before CMD_T_PRE_EXECUTE is assigned, replacing the sess->sess_tearing_down check. > @@ -2760,8 +2770,6 @@ static void target_release_cmd_kref(struct kref *kref) > if (se_sess) { > spin_lock_irqsave(&se_sess->sess_cmd_lock, flags); > list_del_init(&se_cmd->se_cmd_list); > - if (list_empty(&se_sess->sess_cmd_list)) > - wake_up(&se_sess->cmd_list_wq); > spin_unlock_irqrestore(&se_sess->sess_cmd_lock, flags); > } > Btw, I noticed this small regression as part of your commit 00d909a1 in mainline, seperate from this patch. That is, for a iodepth=1 workload the new wake_up() above in target_release_cmd_kref() is getting called every se_cmd->cmd_kref release, even during the normal case when a session is not actually being shutdown.. To address this for <= v4.19 separate from further changes: diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c index 049a966..0473bf5 100644 --- a/drivers/target/target_core_transport.c +++ b/drivers/target/target_core_transport.c @@ -2760,7 +2760,7 @@ static void target_release_cmd_kref(struct kref *kref) if (se_sess) { spin_lock_irqsave(&se_sess->sess_cmd_lock, flags); list_del_init(&se_cmd->se_cmd_list); - if (list_empty(&se_sess->sess_cmd_list)) + if (se_sess->sess_tearing_down && list_empty(&se_sess->sess_cmd_list)) wake_up(&se_sess->cmd_list_wq); spin_unlock_irqrestore(&se_sess->sess_cmd_lock, flags); } I'll send it out separately for MKP to pick up. > @@ -2769,6 +2777,8 @@ static void target_release_cmd_kref(struct kref *kref) > se_cmd->se_tfo->release_cmd(se_cmd); > if (compl) > complete(compl); > + > + percpu_ref_put(&se_sess->cmd_count); > } > > /** > @@ -2897,6 +2907,8 @@ void target_sess_cmd_list_set_waiting(struct se_session *se_sess) > spin_lock_irqsave(&se_sess->sess_cmd_lock, flags); > se_sess->sess_tearing_down = 1; > spin_unlock_irqrestore(&se_sess->sess_cmd_lock, flags); > + > + percpu_ref_kill_and_confirm(&se_sess->cmd_count, NULL); > } > EXPORT_SYMBOL(target_sess_cmd_list_set_waiting); > Here is where percpu-ref and RCU grace-period magic comes in.. As a (future) consequence of relying solely upon se_sess->cmd_count, the percpu_ref_kill_and_confirm() needs a percpu_ref->confirm_switch() callback to work correctly, along with a matching local wait_for_completion() within target_sess_cmd_list_set_waiting(). That is, the completion waits for percpu_ref_switch_to_atomic_rcu() to call percpu_ref->confirm_switch() after a RCU grade-period has elapsed so se_sess->cmd_count is seen as __PERCPU_REF_DEAD on all CPUs. From there, se_sess->cmd_count is switched to atomic_t mode so that percpu_ref_tryget_live() lookup of se_sess->cmd_count fails for all new incoming I/O. This is exactly how se_lun->lun_ref works today. See commit bd4e2d2907f for more information. > @@ -2911,17 +2923,14 @@ void target_wait_for_sess_cmds(struct se_session *se_sess) > > WARN_ON_ONCE(!se_sess->sess_tearing_down); > > - spin_lock_irq(&se_sess->sess_cmd_lock); > do { > - ret = wait_event_interruptible_lock_irq_timeout( > - se_sess->cmd_list_wq, > - list_empty(&se_sess->sess_cmd_list), > - se_sess->sess_cmd_lock, 180 * HZ); > + ret = wait_event_timeout(se_sess->cmd_list_wq, > + percpu_ref_is_zero(&se_sess->cmd_count), > + 180 * HZ); > list_for_each_entry(cmd, &se_sess->sess_cmd_list, se_cmd_list) > target_show_cmd("session shutdown: still waiting for ", > cmd); > } while (ret <= 0); > - spin_unlock_irq(&se_sess->sess_cmd_lock); > } > EXPORT_SYMBOL(target_wait_for_sess_cmds); It would be useful to move the hardcoded '180' into a tunable at some point.
On Sat, 2018-10-06 at 14:05 +0200, Christoph Hellwig wrote: > On Mon, Sep 17, 2018 at 02:35:42PM -0700, Bart Van Assche wrote: > > A session must only be released after all code that accesses the session > > structure has finished. Make sure that this is the case by introducing a > > new command counter per session that is only decremented after the > > .release_cmd() callback has finished. > > Can you explain what problems we are running into right now due to the > lack of a refcount? I will explain that further down in this e-mail. > > @@ -2897,6 +2907,8 @@ void target_sess_cmd_list_set_waiting(struct se_session *se_sess) > > spin_lock_irqsave(&se_sess->sess_cmd_lock, flags); > > se_sess->sess_tearing_down = 1; > > spin_unlock_irqrestore(&se_sess->sess_cmd_lock, flags); > > + > > + percpu_ref_kill_and_confirm(&se_sess->cmd_count, NULL); > > This is equivalent ot a plain percpu_ref_kill call. OK, I will change this into percpu_ref_kill(). > > @@ -2911,17 +2923,14 @@ void target_wait_for_sess_cmds(struct se_session *se_sess) > > > > WARN_ON_ONCE(!se_sess->sess_tearing_down); > > > > - spin_lock_irq(&se_sess->sess_cmd_lock); > > do { > > - ret = wait_event_interruptible_lock_irq_timeout( > > - se_sess->cmd_list_wq, > > - list_empty(&se_sess->sess_cmd_list), > > - se_sess->sess_cmd_lock, 180 * HZ); > > + ret = wait_event_timeout(se_sess->cmd_list_wq, > > + percpu_ref_is_zero(&se_sess->cmd_count), > > + 180 * HZ); > > So this is basically the big change - check for a zero reference instead > of the list_empty. > > I fail to see how this makes a difference, and also why we even need a > percpu ref as the get/pull calls are under, or right next to > sess_cmd_lock critical sections. Hi Christoph, After having called target_wait_for_sess_cmds(), several target drivers call target_remove_session(). That last function frees the session object synchronously. In other words, it is not safe to use the session pointer after target_wait_for_sess_cmds() has returned. That means that target_wait_for_sess_cmds() must wait until all code that can dereference the session pointer has finished, including target_release_cmd_kref(). That function executes the following code after having removed a command from the command list: target_free_cmd_mem(se_cmd); se_cmd->se_tfo->release_cmd(se_cmd); if (compl) complete(compl); Hence this patch that makes target_wait_for_sess_cmds() wait until target_release_cmd_kref() has called .release_cmd(). Since I applied this patch I have not hit the following crash anymore: BUG: KASAN: use-after-free in do_raw_spin_lock+0x1c/0x130 Read of size 4 at addr ffff8801534b16e4 by task rmdir/14805 CPU: 16 PID: 14805 Comm: rmdir Not tainted 4.18.0-rc2-dbg+ #5 Call Trace: dump_stack+0xa4/0xf5 print_address_description+0x6f/0x270 kasan_report+0x241/0x360 __asan_load4+0x78/0x80 do_raw_spin_lock+0x1c/0x130 _raw_spin_lock_irqsave+0x52/0x60 srpt_set_ch_state+0x27/0x70 [ib_srpt] srpt_disconnect_ch+0x1b/0xc0 [ib_srpt] srpt_close_session+0xa8/0x260 [ib_srpt] target_shutdown_sessions+0x170/0x180 [target_core_mod] core_tpg_del_initiator_node_acl+0xf3/0x200 [target_core_mod] target_fabric_nacl_base_release+0x25/0x30 [target_core_mod] config_item_release+0x9c/0x110 [configfs] config_item_put+0x26/0x30 [configfs] configfs_rmdir+0x3b8/0x510 [configfs] vfs_rmdir+0xb3/0x1e0 do_rmdir+0x262/0x2c0 do_syscall_64+0x77/0x230 entry_SYSCALL_64_after_hwframe+0x49/0xbe Please let me know if you need more information. Bart.
On Sat, 2018-10-06 at 20:28 -0700, Nicholas A. Bellinger wrote: > On Mon, 2018-09-17 at 14:35 -0700, Bart Van Assche wrote: > > diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c > > index 94e9d03af99d..54ccd8f56a57 100644 > > --- a/drivers/target/target_core_transport.c > > +++ b/drivers/target/target_core_transport.c > > @@ -224,6 +224,13 @@ void transport_subsystem_check_init(void) > > sub_api_initialized = 1; > > } > > > > +static void target_release_sess_cmd_refcnt(struct percpu_ref *ref) > > +{ > > + struct se_session *sess = container_of(ref, typeof(*sess), cmd_count); > > + > > + wake_up(&sess->cmd_list_wq); > > +} > > + > > /** > > * transport_init_session - initialize a session object > > * @se_sess: Session object pointer. > > @@ -237,7 +244,8 @@ int transport_init_session(struct se_session *se_sess) > > INIT_LIST_HEAD(&se_sess->sess_cmd_list); > > spin_lock_init(&se_sess->sess_cmd_lock); > > init_waitqueue_head(&se_sess->cmd_list_wq); > > - return 0; > > + return percpu_ref_init(&se_sess->cmd_count, > > + target_release_sess_cmd_refcnt, 0, GFP_KERNEL); > > } > > EXPORT_SYMBOL(transport_init_session); > > > > @@ -587,6 +595,7 @@ void transport_free_session(struct se_session *se_sess) > > sbitmap_queue_free(&se_sess->sess_tag_pool); > > kvfree(se_sess->sess_cmd_map); > > } > > + percpu_ref_exit(&se_sess->cmd_count); > > kmem_cache_free(se_sess_cache, se_sess); > > } > > EXPORT_SYMBOL(transport_free_session); > > @@ -2730,6 +2739,7 @@ int target_get_sess_cmd(struct se_cmd *se_cmd, bool ack_kref) > > } > > se_cmd->transport_state |= CMD_T_PRE_EXECUTE; > > list_add_tail(&se_cmd->se_cmd_list, &se_sess->sess_cmd_list); > > + percpu_ref_get(&se_sess->cmd_count); > > out: > > spin_unlock_irqrestore(&se_sess->sess_cmd_lock, flags); > > > > This would need to be a percpu_ref_tryget_live() before > CMD_T_PRE_EXECUTE is assigned, replacing the sess->sess_tearing_down > check. I think the current code is fine since the percpu_ref_get() call happens with the sess_cmd_lock held and after the sess_tearing_down flag has been checked. The spinlock is needed anyway to protect the list_add_tail() call. Let's keep the discussion about dropping se_cmd_list and switching to sbitmap instead for later because this patch series is already complicated enough and also because I'm not convinced that that switching to an sbitmap would be an improvement. > > @@ -2769,6 +2777,8 @@ static void target_release_cmd_kref(struct kref *kref) > > se_cmd->se_tfo->release_cmd(se_cmd); > > if (compl) > > complete(compl); > > + > > + percpu_ref_put(&se_sess->cmd_count); > > } > > > > /** > > @@ -2897,6 +2907,8 @@ void target_sess_cmd_list_set_waiting(struct se_session *se_sess) > > spin_lock_irqsave(&se_sess->sess_cmd_lock, flags); > > se_sess->sess_tearing_down = 1; > > spin_unlock_irqrestore(&se_sess->sess_cmd_lock, flags); > > + > > + percpu_ref_kill_and_confirm(&se_sess->cmd_count, NULL); > > } > > EXPORT_SYMBOL(target_sess_cmd_list_set_waiting); > > > > Here is where percpu-ref and RCU grace-period magic comes in.. > > As a (future) consequence of relying solely upon se_sess->cmd_count, the > percpu_ref_kill_and_confirm() needs a percpu_ref->confirm_switch() > callback to work correctly, along with a matching local > wait_for_completion() within target_sess_cmd_list_set_waiting(). I do not agree. A session is only freed after target_wait_for_sess_cmds() has returned so it is sufficient if that function waits until the switch to atomic mode of the percpu-refcount has finished. I don't see why target_sess_cmd_list_set_waiting() should wait until the switch to atomic mode has finished. Bart.
diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c index 94e9d03af99d..54ccd8f56a57 100644 --- a/drivers/target/target_core_transport.c +++ b/drivers/target/target_core_transport.c @@ -224,6 +224,13 @@ void transport_subsystem_check_init(void) sub_api_initialized = 1; } +static void target_release_sess_cmd_refcnt(struct percpu_ref *ref) +{ + struct se_session *sess = container_of(ref, typeof(*sess), cmd_count); + + wake_up(&sess->cmd_list_wq); +} + /** * transport_init_session - initialize a session object * @se_sess: Session object pointer. @@ -237,7 +244,8 @@ int transport_init_session(struct se_session *se_sess) INIT_LIST_HEAD(&se_sess->sess_cmd_list); spin_lock_init(&se_sess->sess_cmd_lock); init_waitqueue_head(&se_sess->cmd_list_wq); - return 0; + return percpu_ref_init(&se_sess->cmd_count, + target_release_sess_cmd_refcnt, 0, GFP_KERNEL); } EXPORT_SYMBOL(transport_init_session); @@ -587,6 +595,7 @@ void transport_free_session(struct se_session *se_sess) sbitmap_queue_free(&se_sess->sess_tag_pool); kvfree(se_sess->sess_cmd_map); } + percpu_ref_exit(&se_sess->cmd_count); kmem_cache_free(se_sess_cache, se_sess); } EXPORT_SYMBOL(transport_free_session); @@ -2730,6 +2739,7 @@ int target_get_sess_cmd(struct se_cmd *se_cmd, bool ack_kref) } se_cmd->transport_state |= CMD_T_PRE_EXECUTE; list_add_tail(&se_cmd->se_cmd_list, &se_sess->sess_cmd_list); + percpu_ref_get(&se_sess->cmd_count); out: spin_unlock_irqrestore(&se_sess->sess_cmd_lock, flags); @@ -2760,8 +2770,6 @@ static void target_release_cmd_kref(struct kref *kref) if (se_sess) { spin_lock_irqsave(&se_sess->sess_cmd_lock, flags); list_del_init(&se_cmd->se_cmd_list); - if (list_empty(&se_sess->sess_cmd_list)) - wake_up(&se_sess->cmd_list_wq); spin_unlock_irqrestore(&se_sess->sess_cmd_lock, flags); } @@ -2769,6 +2777,8 @@ static void target_release_cmd_kref(struct kref *kref) se_cmd->se_tfo->release_cmd(se_cmd); if (compl) complete(compl); + + percpu_ref_put(&se_sess->cmd_count); } /** @@ -2897,6 +2907,8 @@ void target_sess_cmd_list_set_waiting(struct se_session *se_sess) spin_lock_irqsave(&se_sess->sess_cmd_lock, flags); se_sess->sess_tearing_down = 1; spin_unlock_irqrestore(&se_sess->sess_cmd_lock, flags); + + percpu_ref_kill_and_confirm(&se_sess->cmd_count, NULL); } EXPORT_SYMBOL(target_sess_cmd_list_set_waiting); @@ -2911,17 +2923,14 @@ void target_wait_for_sess_cmds(struct se_session *se_sess) WARN_ON_ONCE(!se_sess->sess_tearing_down); - spin_lock_irq(&se_sess->sess_cmd_lock); do { - ret = wait_event_interruptible_lock_irq_timeout( - se_sess->cmd_list_wq, - list_empty(&se_sess->sess_cmd_list), - se_sess->sess_cmd_lock, 180 * HZ); + ret = wait_event_timeout(se_sess->cmd_list_wq, + percpu_ref_is_zero(&se_sess->cmd_count), + 180 * HZ); list_for_each_entry(cmd, &se_sess->sess_cmd_list, se_cmd_list) target_show_cmd("session shutdown: still waiting for ", cmd); } while (ret <= 0); - spin_unlock_irq(&se_sess->sess_cmd_lock); } EXPORT_SYMBOL(target_wait_for_sess_cmds); diff --git a/include/target/target_core_base.h b/include/target/target_core_base.h index 7a4ee7852ca4..2cfd3b4573b0 100644 --- a/include/target/target_core_base.h +++ b/include/target/target_core_base.h @@ -602,6 +602,7 @@ struct se_session { struct se_node_acl *se_node_acl; struct se_portal_group *se_tpg; void *fabric_sess_ptr; + struct percpu_ref cmd_count; struct list_head sess_list; struct list_head sess_acl_list; struct list_head sess_cmd_list;
A session must only be released after all code that accesses the session structure has finished. Make sure that this is the case by introducing a new command counter per session that is only decremented after the .release_cmd() callback has finished. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Nicholas Bellinger <nab@linux-iscsi.org> Cc: Mike Christie <mchristi@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> --- drivers/target/target_core_transport.c | 27 +++++++++++++++++--------- include/target/target_core_base.h | 1 + 2 files changed, 19 insertions(+), 9 deletions(-)