Message ID | 20221208031002.106700-6-michael.christie@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | scsi target task management fixes | expand |
On Wed, Dec 07, 2022 at 09:10:00PM -0600, Mike Christie wrote: > > This fixes 2 bugs added in: > > commit f36199355c64 ("scsi: target: iscsi: Fix cmd abort fabric stop > race") > > If we have multiple sessions to the same se_device we can hit a race where > a LUN_RESET on one session cleans up the se_cmds from under another > session which is being closed. This results in the closing session freeing > its conn/session structs while they are still in use. > > The bug is: > > 1. Session1 has IO se_cmd1. > 2. Session2 can also have se_cmds for IO and optionally TMRs for ABORTS > but then gets a LUN_RESET. > 3. The LUN_RESET on session2 sees the se_cmds on session1 and during > the drain stages marks them all with CMD_T_ABORTED. > 4. session1 is now closed so iscsit_release_commands_from_conn only sees > se_cmds with the CMD_T_ABORTED bit set and returns immediately even > though we have outstanding commands. > 5. session1's connection and session are freed. > 6. The backend request for se_cmd1 completes and it accesses the freed > connection/session. > > If session1 was executing only IO se_cmds and TAS is set on the se_cmd, > then we need to do a iscsit_free_cmd on those commands, so we wait on > their completion from LIO core and the backend. > > If session1 was waiting on tmr se_cmds or TAS is not set then we need to > wait for those outstanding se_cmds to have their last put done so we > know no user is still accessing them when we free the session/conn. > > This fixes the TAS set case, by adding a check so if we hit it we now call > iscsit_free_cmd. To handle the tmr se_cd and non TAS case, it hooks the > iscsit layer into the cmd counter code, so we can wait for all outstanding > commands before freeing the connection and possibly the session. > > Fixes: f36199355c64 ("scsi: target: iscsi: Fix cmd abort fabric stop race") > Signed-off-by: Mike Christie <michael.christie@oracle.com> > --- > drivers/infiniband/ulp/isert/ib_isert.c | 13 +------------ > drivers/target/iscsi/iscsi_target.c | 13 ++++++++++++- > drivers/target/target_core_transport.c | 6 ++++-- > include/target/target_core_fabric.h | 2 ++ > 4 files changed, 19 insertions(+), 15 deletions(-) > > diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c > index b360a1527cd1..600059d8a3a7 100644 > --- a/drivers/infiniband/ulp/isert/ib_isert.c > +++ b/drivers/infiniband/ulp/isert/ib_isert.c > @@ -2501,17 +2501,6 @@ isert_wait4logout(struct isert_conn *isert_conn) > } > } > > -static void > -isert_wait4cmds(struct iscsit_conn *conn) > -{ > - isert_info("iscsit_conn %p\n", conn); > - > - if (conn->sess) { > - target_stop_session(conn->sess->se_sess); > - target_wait_for_sess_cmds(conn->sess->se_sess); > - } > -} > - > /** > * isert_put_unsol_pending_cmds() - Drop commands waiting for > * unsolicitate dataout > @@ -2559,7 +2548,7 @@ static void isert_wait_conn(struct iscsit_conn *conn) > > ib_drain_qp(isert_conn->qp); > isert_put_unsol_pending_cmds(conn); > - isert_wait4cmds(conn); > + target_wait_for_cmds(conn->cmd_cnt); > isert_wait4logout(isert_conn); > > queue_work(isert_release_wq, &isert_conn->release_work); > diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c > index 7a8ffdf33bee..1c3470e4b50c 100644 > --- a/drivers/target/iscsi/iscsi_target.c > +++ b/drivers/target/iscsi/iscsi_target.c > @@ -4221,7 +4221,8 @@ static void iscsit_release_commands_from_conn(struct iscsit_conn *conn) > > if (se_cmd->se_tfo != NULL) { > spin_lock_irq(&se_cmd->t_state_lock); > - if (se_cmd->transport_state & CMD_T_ABORTED) { > + if (se_cmd->transport_state & CMD_T_ABORTED && > + !(se_cmd->transport_state & CMD_T_TAS)) { > /* > * LIO's abort path owns the cleanup for this, > * so put it back on the list and let Could you please extract ths snippet (fix of the hanged commands with TAS) to a separate patch? It looks good. > @@ -4244,6 +4245,14 @@ static void iscsit_release_commands_from_conn(struct iscsit_conn *conn) > iscsit_free_cmd(cmd, true); > > } > + > + /* > + * Wait on commands that were cleaned up via the aborted_task path. > + * LLDs that implement iscsit_wait_conn will already have waited for > + * commands. > + */ > + if (!conn->conn_transport->iscsit_wait_conn) > + target_wait_for_cmds(conn->cmd_cnt); > } > > static void iscsit_stop_timers_for_cmds( > @@ -4304,6 +4313,8 @@ int iscsit_close_connection( > iscsit_stop_nopin_response_timer(conn); > iscsit_stop_nopin_timer(conn); > > + target_stop_cmd_counter(conn->cmd_cnt); > + > if (conn->conn_transport->iscsit_wait_conn) > conn->conn_transport->iscsit_wait_conn(conn); I strongly believe that waiting for commands complete before decreasing the command refcounter is useless and leads to hangings. There was a several tries to wait for the commands complete in the session. But all of them were eventually reverted due to iSER [1]. [1] https://lore.kernel.org/all/CH2PR12MB4005D671F3D274C4D5FA0BAEDD1C0@CH2PR12MB4005.namprd12.prod.outlook.com/ Let's try it one more time - move conn->conn_transport->iscsit_wait_conn(conn) to the end of iscsit_release_commands_from_conn() to align iser with other iscsi transports. Probably, to have target_wait_for_cmds as a default .iscsit_wait_conn implementation would be the best way. > > diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c > index 90e3b1aef1f1..8bbf0c834b74 100644 > --- a/drivers/target/target_core_transport.c > +++ b/drivers/target/target_core_transport.c > @@ -3174,13 +3174,14 @@ static void target_stop_cmd_counter_confirm(struct percpu_ref *ref) > * target_stop_cmd_counter - Stop new IO from being added to the counter. > * @cmd_cnt: counter to stop > */ > -static void target_stop_cmd_counter(struct target_cmd_counter *cmd_cnt) > +void target_stop_cmd_counter(struct target_cmd_counter *cmd_cnt) > { > pr_debug("Stopping command counter.\n"); > if (!atomic_cmpxchg(&cmd_cnt->stopped, 0, 1)) > percpu_ref_kill_and_confirm(&cmd_cnt->refcnt, > target_stop_cmd_counter_confirm); > } > +EXPORT_SYMBOL_GPL(target_stop_cmd_counter); > > /** > * target_stop_session - Stop new IO from being queued on the session. > @@ -3196,7 +3197,7 @@ EXPORT_SYMBOL(target_stop_session); > * target_wait_for_cmds - Wait for outstanding cmds. > * @cmd_cnt: counter to wait for active I/O for. > */ > -static void target_wait_for_cmds(struct target_cmd_counter *cmd_cnt) > +void target_wait_for_cmds(struct target_cmd_counter *cmd_cnt) > { > int ret; > > @@ -3212,6 +3213,7 @@ static void target_wait_for_cmds(struct target_cmd_counter *cmd_cnt) > wait_for_completion(&cmd_cnt->stop_done); > pr_debug("Waiting for cmds done.\n"); > } > +EXPORT_SYMBOL_GPL(target_wait_for_cmds); > > /** > * target_wait_for_sess_cmds - Wait for outstanding commands > diff --git a/include/target/target_core_fabric.h b/include/target/target_core_fabric.h > index 4cbfb532a431..b188b1e90e1e 100644 > --- a/include/target/target_core_fabric.h > +++ b/include/target/target_core_fabric.h > @@ -133,6 +133,8 @@ struct se_session *target_setup_session(struct se_portal_group *, > struct se_session *, void *)); > void target_remove_session(struct se_session *); > > +void target_stop_cmd_counter(struct target_cmd_counter *cmd_cnt); > +void target_wait_for_cmds(struct target_cmd_counter *cmd_cnt); > struct target_cmd_counter *target_alloc_cmd_counter(void); > void target_free_cmd_counter(struct target_cmd_counter *cmd_cnt); > > -- > 2.25.1 > >
On 12/9/22 6:32 AM, Dmitry Bogdanov wrote: > On Wed, Dec 07, 2022 at 09:10:00PM -0600, Mike Christie wrote: >> >> This fixes 2 bugs added in: >> >> commit f36199355c64 ("scsi: target: iscsi: Fix cmd abort fabric stop >> race") >> >> If we have multiple sessions to the same se_device we can hit a race where >> a LUN_RESET on one session cleans up the se_cmds from under another >> session which is being closed. This results in the closing session freeing >> its conn/session structs while they are still in use. >> >> The bug is: >> >> 1. Session1 has IO se_cmd1. >> 2. Session2 can also have se_cmds for IO and optionally TMRs for ABORTS >> but then gets a LUN_RESET. >> 3. The LUN_RESET on session2 sees the se_cmds on session1 and during >> the drain stages marks them all with CMD_T_ABORTED. >> 4. session1 is now closed so iscsit_release_commands_from_conn only sees >> se_cmds with the CMD_T_ABORTED bit set and returns immediately even >> though we have outstanding commands. >> 5. session1's connection and session are freed. >> 6. The backend request for se_cmd1 completes and it accesses the freed >> connection/session. >> >> If session1 was executing only IO se_cmds and TAS is set on the se_cmd, >> then we need to do a iscsit_free_cmd on those commands, so we wait on >> their completion from LIO core and the backend. >> >> If session1 was waiting on tmr se_cmds or TAS is not set then we need to >> wait for those outstanding se_cmds to have their last put done so we >> know no user is still accessing them when we free the session/conn. >> >> This fixes the TAS set case, by adding a check so if we hit it we now call >> iscsit_free_cmd. To handle the tmr se_cd and non TAS case, it hooks the >> iscsit layer into the cmd counter code, so we can wait for all outstanding >> commands before freeing the connection and possibly the session. >> >> Fixes: f36199355c64 ("scsi: target: iscsi: Fix cmd abort fabric stop race") >> Signed-off-by: Mike Christie <michael.christie@oracle.com> >> --- >> drivers/infiniband/ulp/isert/ib_isert.c | 13 +------------ >> drivers/target/iscsi/iscsi_target.c | 13 ++++++++++++- >> drivers/target/target_core_transport.c | 6 ++++-- >> include/target/target_core_fabric.h | 2 ++ >> 4 files changed, 19 insertions(+), 15 deletions(-) >> >> diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c >> index b360a1527cd1..600059d8a3a7 100644 >> --- a/drivers/infiniband/ulp/isert/ib_isert.c >> +++ b/drivers/infiniband/ulp/isert/ib_isert.c >> @@ -2501,17 +2501,6 @@ isert_wait4logout(struct isert_conn *isert_conn) >> } >> } >> >> -static void >> -isert_wait4cmds(struct iscsit_conn *conn) >> -{ >> - isert_info("iscsit_conn %p\n", conn); >> - >> - if (conn->sess) { >> - target_stop_session(conn->sess->se_sess); >> - target_wait_for_sess_cmds(conn->sess->se_sess); >> - } >> -} >> - >> /** >> * isert_put_unsol_pending_cmds() - Drop commands waiting for >> * unsolicitate dataout >> @@ -2559,7 +2548,7 @@ static void isert_wait_conn(struct iscsit_conn *conn) >> >> ib_drain_qp(isert_conn->qp); >> isert_put_unsol_pending_cmds(conn); >> - isert_wait4cmds(conn); >> + target_wait_for_cmds(conn->cmd_cnt); >> isert_wait4logout(isert_conn); >> >> queue_work(isert_release_wq, &isert_conn->release_work); >> diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c >> index 7a8ffdf33bee..1c3470e4b50c 100644 >> --- a/drivers/target/iscsi/iscsi_target.c >> +++ b/drivers/target/iscsi/iscsi_target.c >> @@ -4221,7 +4221,8 @@ static void iscsit_release_commands_from_conn(struct iscsit_conn *conn) >> >> if (se_cmd->se_tfo != NULL) { >> spin_lock_irq(&se_cmd->t_state_lock); >> - if (se_cmd->transport_state & CMD_T_ABORTED) { >> + if (se_cmd->transport_state & CMD_T_ABORTED && >> + !(se_cmd->transport_state & CMD_T_TAS)) { >> /* >> * LIO's abort path owns the cleanup for this, >> * so put it back on the list and let > > Could you please extract ths snippet (fix of the hanged commands with > TAS) to a separate patch? It looks good. Yeah. > >> @@ -4244,6 +4245,14 @@ static void iscsit_release_commands_from_conn(struct iscsit_conn *conn) >> iscsit_free_cmd(cmd, true); >> >> } >> + >> + /* >> + * Wait on commands that were cleaned up via the aborted_task path. >> + * LLDs that implement iscsit_wait_conn will already have waited for >> + * commands. >> + */ >> + if (!conn->conn_transport->iscsit_wait_conn) >> + target_wait_for_cmds(conn->cmd_cnt); >> } >> >> static void iscsit_stop_timers_for_cmds( >> @@ -4304,6 +4313,8 @@ int iscsit_close_connection( >> iscsit_stop_nopin_response_timer(conn); >> iscsit_stop_nopin_timer(conn); >> >> + target_stop_cmd_counter(conn->cmd_cnt); >> + >> if (conn->conn_transport->iscsit_wait_conn) >> conn->conn_transport->iscsit_wait_conn(conn); > > I strongly believe that waiting for commands complete before decreasing > the command refcounter is useless and leads to hangings. > There was a several tries to wait for the commands complete in the > session. But all of them were eventually reverted due to iSER [1]. > [1] https://lore.kernel.org/all/CH2PR12MB4005D671F3D274C4D5FA0BAEDD1C0@CH2PR12MB4005.namprd12.prod.outlook.com/ > Yeah, I saw those. It's why we have the stop+wait split and why I left the isert target_wait_session where it was. > Let's try it one more time - move conn->conn_transport->iscsit_wait_conn(conn) > to the end of iscsit_release_commands_from_conn() to align iser with other > iscsi transports. > > Probably, to have target_wait_for_cmds as a default .iscsit_wait_conn > implementation would be the best way. > I think we have to do that in another patchset because this was meant to just fix the 2 regressions (that cleanup patch does not have conflicts so no need to backport it). What you want to do is not going to be as simple as just moving that call around so we can't sneak it in. The iscsit_wait_conn is where it is because it does 2 things: 1. The issue with Bart's patches was isert doesn't submit commands to lio from the iscsi rx thread so killing it or running iscsit_close_connection from it doesn't prevent new commands from coming in like it does for iscsi. For isert you have to disconnect the connection before cleaning up so we don't get new commands while/after cleaning up. 2. Because isert submits from the ib context we have to do a flush like call to make sure it's not still running similar to how for iscsi we kill the rx thread then do the cleanup from the tx thread (or the reverse depending on which thread starts the recovery). So the target_stop/wait_session did that for the driver. The current code works for iscsit because we send/recv from the tx/rx threads. If we recvd from the sk_data_ready callback then we would have a similar problem as isert. So we have to do: 1. Add a new callout close_connection which for isert does this initial disconnect cleanup and is called before iscsit_release_commands_from_conn. Bart's patch added a close callout but called it was called too late. 2. The target_stop_session calls acts as flush right now. When it returns we know the ib layer is not calling into the iscsi/lio layers anymore. We need something to replace this. 3. We can then do target_wait_for_cmds after iscsit_release_commands_from_conn for both drivers, but, we might need some more changes. See below. When we do iscsit iscsit_release_commands_from_conn we are: 1. Waiting on commands in the backend and LIO core. 2. Doing the last put on commands that have had queue_status called but we haven't freed the cmd because they haven't been ackd. Are we hitting an issue with #2? We need a proper bug and analysis or we are just guessing and am going to mess up. For example, for isert is the bug you are worried about that we have a missing isert_send_done/isert_completion_put call because we disconnected before the send callbacks could be done or because the ib layer won't call isert_send_done when it detects a failure? If so then yeah, the current code is going to hang since nothing is going to do the last put on the session since that's done later. To handle that we will have to move code around so we do isert_unmap_tx_desc from iscsit_unmap_cmd or do something so we don't leak those resources.
On 12/7/22 9:10 PM, Mike Christie wrote: > static void iscsit_stop_timers_for_cmds( > @@ -4304,6 +4313,8 @@ int iscsit_close_connection( > iscsit_stop_nopin_response_timer(conn); > iscsit_stop_nopin_timer(conn); > > + target_stop_cmd_counter(conn->cmd_cnt); > + > if (conn->conn_transport->iscsit_wait_conn) > conn->conn_transport->iscsit_wait_conn(conn); > Maurizo, don't test these patches. There is a bug where we have a missing target_stop_cmd_counter. If the login fails then we don't go through this path and will not do a stop. I'll send a updated patchset later.
On 12/10/22 12:48 PM, Mike Christie wrote: > > When we do iscsit iscsit_release_commands_from_conn we are: > > 1. Waiting on commands in the backend and LIO core. > 2. Doing the last put on commands that have had queue_status called but > we haven't freed the cmd because they haven't been ackd. > > Are we hitting an issue with #2? We need a proper bug and analysis or we are > just guessing and am going to mess up. > > For example, for isert is the bug you are worried about that we have a missing > isert_send_done/isert_completion_put call because we disconnected before the > send callbacks could be done or because the ib layer won't call isert_send_done > when it detects a failure? I tested this and it's actually opposite and broken for a different reason :) It looks like we will still call isert_send_done for the cases above so we are ok there. The target_wait_for_cmds call will also sync us up those calls as well. So if we move isert's target_wait_for_cmds we have to flush those calls as well or add some more checks/refcounts or something. It turns out instead of a hang there is use after free. We can race where isert_put_unsol_pending_cmds does a isert_put_cmd but then isert_send_done can be running and also does isert_completion_put -> isert_put_cmd, so we hit a use after free due to the isert_put_unsol_pending_cmds calls freeing the se_cmd.
On 12/10/22 7:38 PM, Mike Christie wrote: > On 12/10/22 12:48 PM, Mike Christie wrote: >> >> When we do iscsit iscsit_release_commands_from_conn we are: >> >> 1. Waiting on commands in the backend and LIO core. >> 2. Doing the last put on commands that have had queue_status called but >> we haven't freed the cmd because they haven't been ackd. >> >> Are we hitting an issue with #2? We need a proper bug and analysis or we are >> just guessing and am going to mess up. >> >> For example, for isert is the bug you are worried about that we have a missing >> isert_send_done/isert_completion_put call because we disconnected before the >> send callbacks could be done or because the ib layer won't call isert_send_done >> when it detects a failure? > I tested this and it's actually opposite and broken for a different reason :) > Hey Dimitry, it looks like you were right :) After I fixed the use after free with isert, I was able to test the case where there are TMRs running with isert and the connection closes and that was broken in multiple places. The updated patches end up merging the isert and iscsit command cleanup and wait code in the end like you requested.
diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c index b360a1527cd1..600059d8a3a7 100644 --- a/drivers/infiniband/ulp/isert/ib_isert.c +++ b/drivers/infiniband/ulp/isert/ib_isert.c @@ -2501,17 +2501,6 @@ isert_wait4logout(struct isert_conn *isert_conn) } } -static void -isert_wait4cmds(struct iscsit_conn *conn) -{ - isert_info("iscsit_conn %p\n", conn); - - if (conn->sess) { - target_stop_session(conn->sess->se_sess); - target_wait_for_sess_cmds(conn->sess->se_sess); - } -} - /** * isert_put_unsol_pending_cmds() - Drop commands waiting for * unsolicitate dataout @@ -2559,7 +2548,7 @@ static void isert_wait_conn(struct iscsit_conn *conn) ib_drain_qp(isert_conn->qp); isert_put_unsol_pending_cmds(conn); - isert_wait4cmds(conn); + target_wait_for_cmds(conn->cmd_cnt); isert_wait4logout(isert_conn); queue_work(isert_release_wq, &isert_conn->release_work); diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c index 7a8ffdf33bee..1c3470e4b50c 100644 --- a/drivers/target/iscsi/iscsi_target.c +++ b/drivers/target/iscsi/iscsi_target.c @@ -4221,7 +4221,8 @@ static void iscsit_release_commands_from_conn(struct iscsit_conn *conn) if (se_cmd->se_tfo != NULL) { spin_lock_irq(&se_cmd->t_state_lock); - if (se_cmd->transport_state & CMD_T_ABORTED) { + if (se_cmd->transport_state & CMD_T_ABORTED && + !(se_cmd->transport_state & CMD_T_TAS)) { /* * LIO's abort path owns the cleanup for this, * so put it back on the list and let @@ -4244,6 +4245,14 @@ static void iscsit_release_commands_from_conn(struct iscsit_conn *conn) iscsit_free_cmd(cmd, true); } + + /* + * Wait on commands that were cleaned up via the aborted_task path. + * LLDs that implement iscsit_wait_conn will already have waited for + * commands. + */ + if (!conn->conn_transport->iscsit_wait_conn) + target_wait_for_cmds(conn->cmd_cnt); } static void iscsit_stop_timers_for_cmds( @@ -4304,6 +4313,8 @@ int iscsit_close_connection( iscsit_stop_nopin_response_timer(conn); iscsit_stop_nopin_timer(conn); + target_stop_cmd_counter(conn->cmd_cnt); + if (conn->conn_transport->iscsit_wait_conn) conn->conn_transport->iscsit_wait_conn(conn); diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c index 90e3b1aef1f1..8bbf0c834b74 100644 --- a/drivers/target/target_core_transport.c +++ b/drivers/target/target_core_transport.c @@ -3174,13 +3174,14 @@ static void target_stop_cmd_counter_confirm(struct percpu_ref *ref) * target_stop_cmd_counter - Stop new IO from being added to the counter. * @cmd_cnt: counter to stop */ -static void target_stop_cmd_counter(struct target_cmd_counter *cmd_cnt) +void target_stop_cmd_counter(struct target_cmd_counter *cmd_cnt) { pr_debug("Stopping command counter.\n"); if (!atomic_cmpxchg(&cmd_cnt->stopped, 0, 1)) percpu_ref_kill_and_confirm(&cmd_cnt->refcnt, target_stop_cmd_counter_confirm); } +EXPORT_SYMBOL_GPL(target_stop_cmd_counter); /** * target_stop_session - Stop new IO from being queued on the session. @@ -3196,7 +3197,7 @@ EXPORT_SYMBOL(target_stop_session); * target_wait_for_cmds - Wait for outstanding cmds. * @cmd_cnt: counter to wait for active I/O for. */ -static void target_wait_for_cmds(struct target_cmd_counter *cmd_cnt) +void target_wait_for_cmds(struct target_cmd_counter *cmd_cnt) { int ret; @@ -3212,6 +3213,7 @@ static void target_wait_for_cmds(struct target_cmd_counter *cmd_cnt) wait_for_completion(&cmd_cnt->stop_done); pr_debug("Waiting for cmds done.\n"); } +EXPORT_SYMBOL_GPL(target_wait_for_cmds); /** * target_wait_for_sess_cmds - Wait for outstanding commands diff --git a/include/target/target_core_fabric.h b/include/target/target_core_fabric.h index 4cbfb532a431..b188b1e90e1e 100644 --- a/include/target/target_core_fabric.h +++ b/include/target/target_core_fabric.h @@ -133,6 +133,8 @@ struct se_session *target_setup_session(struct se_portal_group *, struct se_session *, void *)); void target_remove_session(struct se_session *); +void target_stop_cmd_counter(struct target_cmd_counter *cmd_cnt); +void target_wait_for_cmds(struct target_cmd_counter *cmd_cnt); struct target_cmd_counter *target_alloc_cmd_counter(void); void target_free_cmd_counter(struct target_cmd_counter *cmd_cnt);
This fixes 2 bugs added in: commit f36199355c64 ("scsi: target: iscsi: Fix cmd abort fabric stop race") If we have multiple sessions to the same se_device we can hit a race where a LUN_RESET on one session cleans up the se_cmds from under another session which is being closed. This results in the closing session freeing its conn/session structs while they are still in use. The bug is: 1. Session1 has IO se_cmd1. 2. Session2 can also have se_cmds for IO and optionally TMRs for ABORTS but then gets a LUN_RESET. 3. The LUN_RESET on session2 sees the se_cmds on session1 and during the drain stages marks them all with CMD_T_ABORTED. 4. session1 is now closed so iscsit_release_commands_from_conn only sees se_cmds with the CMD_T_ABORTED bit set and returns immediately even though we have outstanding commands. 5. session1's connection and session are freed. 6. The backend request for se_cmd1 completes and it accesses the freed connection/session. If session1 was executing only IO se_cmds and TAS is set on the se_cmd, then we need to do a iscsit_free_cmd on those commands, so we wait on their completion from LIO core and the backend. If session1 was waiting on tmr se_cmds or TAS is not set then we need to wait for those outstanding se_cmds to have their last put done so we know no user is still accessing them when we free the session/conn. This fixes the TAS set case, by adding a check so if we hit it we now call iscsit_free_cmd. To handle the tmr se_cd and non TAS case, it hooks the iscsit layer into the cmd counter code, so we can wait for all outstanding commands before freeing the connection and possibly the session. Fixes: f36199355c64 ("scsi: target: iscsi: Fix cmd abort fabric stop race") Signed-off-by: Mike Christie <michael.christie@oracle.com> --- drivers/infiniband/ulp/isert/ib_isert.c | 13 +------------ drivers/target/iscsi/iscsi_target.c | 13 ++++++++++++- drivers/target/target_core_transport.c | 6 ++++-- include/target/target_core_fabric.h | 2 ++ 4 files changed, 19 insertions(+), 15 deletions(-)