Message ID | 20230129234441.116310-8-michael.christie@oracle.com (mailing list archive) |
---|---|
State | Deferred |
Headers | show |
Series | target: TMF and recovery fixes | expand |
On Sun, Jan 29, 2023 at 05:44:34PM -0600, Mike Christie wrote: > > This fixes a bug where an initiator thinks a LUN_RESET has cleaned > up running commands when it hasn't. The bug was added in: > > commit 51ec502a3266 ("target: Delete tmr from list before processing") > > The problem occurs when: > > 1. We have N IO cmds running in the target layer spread over 2 sessions. > 2. The initiator sends a LUN_RESET for each session. > 3. session1's LUN_RESET loops over all the running commands from both > sessions and moves them to its local drain_task_list. > 4. session2's LUN_RESET does not see the LUN_RESET from session1 because > the commit above has it remove itself. session2 also does not see any > commands since the other reset moved them off the state lists. > 5. sessions2's LUN_RESET will then complete with a successful response. > 6. sessions2's inititor believes the running commands on its session are > now cleaned up due to the successful response and cleans up the running > commands from its side. It then restarts them. > 7. The commands do eventually complete on the backend and the target > starts to return aborted task statuses for them. The initiator will > either throw a invalid ITT error or might accidentally lookup a new task > if the ITT has been reallocated already. > > This fixes the bug by reverting the patch. > > Fixes: 51ec502a3266 ("target: Delete tmr from list before processing") > Signed-off-by: Mike Christie <michael.christie@oracle.com> > Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com> Actually, this patch even fixes a crash that we've just faced. The second LUN_RESET moves the first LUN_RESET from tmr_list to its drain_tmr_list, then the first LUN_RESET removes itself from second`s drain_tmr_list, then the second LUN_RESET tries to remove the first from the list and crashes because it was deleted already. So, Tested-by: Dmitry Bogdanov <d.bogdanov@yadro.com>
On Sat, Feb 11, 2023 at 11:59:22AM +0300, Dmitry Bogdanov wrote: > On Sun, Jan 29, 2023 at 05:44:34PM -0600, Mike Christie wrote: > > > > This fixes a bug where an initiator thinks a LUN_RESET has cleaned > > up running commands when it hasn't. The bug was added in: > > > > commit 51ec502a3266 ("target: Delete tmr from list before processing") > > > > The problem occurs when: > > > > 1. We have N IO cmds running in the target layer spread over 2 sessions. > > 2. The initiator sends a LUN_RESET for each session. > > 3. session1's LUN_RESET loops over all the running commands from both > > sessions and moves them to its local drain_task_list. > > 4. session2's LUN_RESET does not see the LUN_RESET from session1 because > > the commit above has it remove itself. session2 also does not see any > > commands since the other reset moved them off the state lists. > > 5. sessions2's LUN_RESET will then complete with a successful response. > > 6. sessions2's inititor believes the running commands on its session are > > now cleaned up due to the successful response and cleans up the running > > commands from its side. It then restarts them. > > 7. The commands do eventually complete on the backend and the target > > starts to return aborted task statuses for them. The initiator will > > either throw a invalid ITT error or might accidentally lookup a new task > > if the ITT has been reallocated already. > > > > This fixes the bug by reverting the patch. > > > > Fixes: 51ec502a3266 ("target: Delete tmr from list before processing") > > Signed-off-by: Mike Christie <michael.christie@oracle.com> > > Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com> > > Actually, this patch even fixes a crash that we've just faced. > The second LUN_RESET moves the first LUN_RESET from tmr_list to its > drain_tmr_list, then the first LUN_RESET removes itself from second`s > drain_tmr_list, then the second LUN_RESET tries to remove the first from > the list and crashes because it was deleted already. > So, > > Tested-by: Dmitry Bogdanov <d.bogdanov@yadro.com> Unfortunately, I am revoking my tags. This patch leads to deadlock of two LUN_RESETs waiting for each other in its drain_tmr_list. To keep LUN_RESETs ignoring each other something like that is needed: > --- > drivers/target/target_core_tmr.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/drivers/target/target_core_tmr.c b/drivers/target/target_core_tmr.c > index 2b95b4550a63..a60802b4c5a3 100644 > --- a/drivers/target/target_core_tmr.c > +++ b/drivers/target/target_core_tmr.c > @@ -188,9 +188,10 @@ static void core_tmr_drain_tmr_list( > * LUN_RESET tmr.. > */ > spin_lock_irqsave(&dev->se_tmr_lock, flags); > - if (tmr) > - list_del_init(&tmr->tmr_list); > list_for_each_entry_safe(tmr_p, tmr_pp, &dev->dev_tmr_list, tmr_list) { - > + if (tmr_p == tmr) - > + continue; - > + + /* Ignore LUN_RESETs to avoid deadlocks */ + if (tmr_p->function == TMR_LUN_RESET) + continue; + > cmd = tmr_p->task_cmd; > if (!cmd) { > pr_err("Unable to locate struct se_cmd for TMR\n"); > -- > 2.25.1
On 3/2/23 3:43 AM, Dmitry Bogdanov wrote: > On Sat, Feb 11, 2023 at 11:59:22AM +0300, Dmitry Bogdanov wrote: >> On Sun, Jan 29, 2023 at 05:44:34PM -0600, Mike Christie wrote: >>> >>> This fixes a bug where an initiator thinks a LUN_RESET has cleaned >>> up running commands when it hasn't. The bug was added in: >>> >>> commit 51ec502a3266 ("target: Delete tmr from list before processing") >>> >>> The problem occurs when: >>> >>> 1. We have N IO cmds running in the target layer spread over 2 sessions. >>> 2. The initiator sends a LUN_RESET for each session. >>> 3. session1's LUN_RESET loops over all the running commands from both >>> sessions and moves them to its local drain_task_list. >>> 4. session2's LUN_RESET does not see the LUN_RESET from session1 because >>> the commit above has it remove itself. session2 also does not see any >>> commands since the other reset moved them off the state lists. >>> 5. sessions2's LUN_RESET will then complete with a successful response. >>> 6. sessions2's inititor believes the running commands on its session are >>> now cleaned up due to the successful response and cleans up the running >>> commands from its side. It then restarts them. >>> 7. The commands do eventually complete on the backend and the target >>> starts to return aborted task statuses for them. The initiator will >>> either throw a invalid ITT error or might accidentally lookup a new task >>> if the ITT has been reallocated already. >>> >>> This fixes the bug by reverting the patch. >>> >>> Fixes: 51ec502a3266 ("target: Delete tmr from list before processing") >>> Signed-off-by: Mike Christie <michael.christie@oracle.com> >>> Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com> >> >> Actually, this patch even fixes a crash that we've just faced. >> The second LUN_RESET moves the first LUN_RESET from tmr_list to its >> drain_tmr_list, then the first LUN_RESET removes itself from second`s >> drain_tmr_list, then the second LUN_RESET tries to remove the first from >> the list and crashes because it was deleted already. >> So, >> >> Tested-by: Dmitry Bogdanov <d.bogdanov@yadro.com> > > Unfortunately, I am revoking my tags. This patch leads to deadlock of two > LUN_RESETs waiting for each other in its drain_tmr_list. > > To keep LUN_RESETs ignoring each other something like that is needed: >> --- >> drivers/target/target_core_tmr.c | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/target/target_core_tmr.c b/drivers/target/target_core_tmr.c >> index 2b95b4550a63..a60802b4c5a3 100644 >> --- a/drivers/target/target_core_tmr.c >> +++ b/drivers/target/target_core_tmr.c >> @@ -188,9 +188,10 @@ static void core_tmr_drain_tmr_list( >> * LUN_RESET tmr.. >> */ >> spin_lock_irqsave(&dev->se_tmr_lock, flags); >> - if (tmr) >> - list_del_init(&tmr->tmr_list); >> list_for_each_entry_safe(tmr_p, tmr_pp, &dev->dev_tmr_list, tmr_list) { > - > + if (tmr_p == tmr) > - > + continue; > - > + > > + /* Ignore LUN_RESETs to avoid deadlocks */ > + if (tmr_p->function == TMR_LUN_RESET) > + continue; > + Shoot, that adds back the bug I was hitting. I have an idea for how to fix both issues, but let me do some more testing and will repost the set. Thanks for testing and reviewing this.
On Thu, Mar 02, 2023 at 11:29:50AM -0600, Mike Christie wrote: > > On 3/2/23 3:43 AM, Dmitry Bogdanov wrote: > > On Sat, Feb 11, 2023 at 11:59:22AM +0300, Dmitry Bogdanov wrote: > >> On Sun, Jan 29, 2023 at 05:44:34PM -0600, Mike Christie wrote: > >>> > >>> This fixes a bug where an initiator thinks a LUN_RESET has cleaned > >>> up running commands when it hasn't. The bug was added in: > >>> > >>> commit 51ec502a3266 ("target: Delete tmr from list before processing") > >>> > >>> The problem occurs when: > >>> > >>> 1. We have N IO cmds running in the target layer spread over 2 sessions. > >>> 2. The initiator sends a LUN_RESET for each session. > >>> 3. session1's LUN_RESET loops over all the running commands from both > >>> sessions and moves them to its local drain_task_list. > >>> 4. session2's LUN_RESET does not see the LUN_RESET from session1 because > >>> the commit above has it remove itself. session2 also does not see any > >>> commands since the other reset moved them off the state lists. > >>> 5. sessions2's LUN_RESET will then complete with a successful response. > >>> 6. sessions2's inititor believes the running commands on its session are > >>> now cleaned up due to the successful response and cleans up the running > >>> commands from its side. It then restarts them. > >>> 7. The commands do eventually complete on the backend and the target > >>> starts to return aborted task statuses for them. The initiator will > >>> either throw a invalid ITT error or might accidentally lookup a new task > >>> if the ITT has been reallocated already. > >>> > >>> This fixes the bug by reverting the patch. > >>> > >>> Fixes: 51ec502a3266 ("target: Delete tmr from list before processing") > >>> Signed-off-by: Mike Christie <michael.christie@oracle.com> > >>> Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com> > >> > >> Actually, this patch even fixes a crash that we've just faced. > >> The second LUN_RESET moves the first LUN_RESET from tmr_list to its > >> drain_tmr_list, then the first LUN_RESET removes itself from second`s > >> drain_tmr_list, then the second LUN_RESET tries to remove the first from > >> the list and crashes because it was deleted already. > >> So, > >> > >> Tested-by: Dmitry Bogdanov <d.bogdanov@yadro.com> > > > > Unfortunately, I am revoking my tags. This patch leads to deadlock of two > > LUN_RESETs waiting for each other in its drain_tmr_list. > > > > To keep LUN_RESETs ignoring each other something like that is needed: > >> --- > >> drivers/target/target_core_tmr.c | 5 +++-- > >> 1 file changed, 3 insertions(+), 2 deletions(-) > >> > >> diff --git a/drivers/target/target_core_tmr.c b/drivers/target/target_core_tmr.c > >> index 2b95b4550a63..a60802b4c5a3 100644 > >> --- a/drivers/target/target_core_tmr.c > >> +++ b/drivers/target/target_core_tmr.c > >> @@ -188,9 +188,10 @@ static void core_tmr_drain_tmr_list( > >> * LUN_RESET tmr.. > >> */ > >> spin_lock_irqsave(&dev->se_tmr_lock, flags); > >> - if (tmr) > >> - list_del_init(&tmr->tmr_list); > >> list_for_each_entry_safe(tmr_p, tmr_pp, &dev->dev_tmr_list, tmr_list) { > > - > + if (tmr_p == tmr) > > - > + continue; > > - > + > > > > + /* Ignore LUN_RESETs to avoid deadlocks */ > > + if (tmr_p->function == TMR_LUN_RESET) > > + continue; > > + > > Shoot, that adds back the bug I was hitting. Haha, exactly. > I have an idea for how to fix both issues, but let me do some more testing > and will repost the set. Thanks for testing and reviewing this. scst serializes LUN_RESET execution, may be we should do the same.
On 3/3/23 3:26 AM, Dmitry Bogdanov wrote: > On Thu, Mar 02, 2023 at 11:29:50AM -0600, Mike Christie wrote: >> >> On 3/2/23 3:43 AM, Dmitry Bogdanov wrote: >>> On Sat, Feb 11, 2023 at 11:59:22AM +0300, Dmitry Bogdanov wrote: >>>> On Sun, Jan 29, 2023 at 05:44:34PM -0600, Mike Christie wrote: >>>>> >>>>> This fixes a bug where an initiator thinks a LUN_RESET has cleaned >>>>> up running commands when it hasn't. The bug was added in: >>>>> >>>>> commit 51ec502a3266 ("target: Delete tmr from list before processing") >>>>> >>>>> The problem occurs when: >>>>> >>>>> 1. We have N IO cmds running in the target layer spread over 2 sessions. >>>>> 2. The initiator sends a LUN_RESET for each session. >>>>> 3. session1's LUN_RESET loops over all the running commands from both >>>>> sessions and moves them to its local drain_task_list. >>>>> 4. session2's LUN_RESET does not see the LUN_RESET from session1 because >>>>> the commit above has it remove itself. session2 also does not see any >>>>> commands since the other reset moved them off the state lists. >>>>> 5. sessions2's LUN_RESET will then complete with a successful response. >>>>> 6. sessions2's inititor believes the running commands on its session are >>>>> now cleaned up due to the successful response and cleans up the running >>>>> commands from its side. It then restarts them. >>>>> 7. The commands do eventually complete on the backend and the target >>>>> starts to return aborted task statuses for them. The initiator will >>>>> either throw a invalid ITT error or might accidentally lookup a new task >>>>> if the ITT has been reallocated already. >>>>> >>>>> This fixes the bug by reverting the patch. >>>>> >>>>> Fixes: 51ec502a3266 ("target: Delete tmr from list before processing") >>>>> Signed-off-by: Mike Christie <michael.christie@oracle.com> >>>>> Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com> >>>> >>>> Actually, this patch even fixes a crash that we've just faced. >>>> The second LUN_RESET moves the first LUN_RESET from tmr_list to its >>>> drain_tmr_list, then the first LUN_RESET removes itself from second`s >>>> drain_tmr_list, then the second LUN_RESET tries to remove the first from >>>> the list and crashes because it was deleted already. >>>> So, >>>> >>>> Tested-by: Dmitry Bogdanov <d.bogdanov@yadro.com> >>> >>> Unfortunately, I am revoking my tags. This patch leads to deadlock of two >>> LUN_RESETs waiting for each other in its drain_tmr_list. >>> >>> To keep LUN_RESETs ignoring each other something like that is needed: >>>> --- >>>> drivers/target/target_core_tmr.c | 5 +++-- >>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/target/target_core_tmr.c b/drivers/target/target_core_tmr.c >>>> index 2b95b4550a63..a60802b4c5a3 100644 >>>> --- a/drivers/target/target_core_tmr.c >>>> +++ b/drivers/target/target_core_tmr.c >>>> @@ -188,9 +188,10 @@ static void core_tmr_drain_tmr_list( >>>> * LUN_RESET tmr.. >>>> */ >>>> spin_lock_irqsave(&dev->se_tmr_lock, flags); >>>> - if (tmr) >>>> - list_del_init(&tmr->tmr_list); >>>> list_for_each_entry_safe(tmr_p, tmr_pp, &dev->dev_tmr_list, tmr_list) { >>> - > + if (tmr_p == tmr) >>> - > + continue; >>> - > + >>> >>> + /* Ignore LUN_RESETs to avoid deadlocks */ >>> + if (tmr_p->function == TMR_LUN_RESET) >>> + continue; >>> + >> >> Shoot, that adds back the bug I was hitting. > > Haha, exactly. > >> I have an idea for how to fix both issues, but let me do some more testing >> and will repost the set. Thanks for testing and reviewing this. > > scst serializes LUN_RESET execution, may be we should do the same. > That's what I'm working on. I was thinking async aborts are nice for ESX where you have N VMs on a LUN and you don't want VM1's abort to have to wait for VM2's abort so left that. However, for LUN_RESET, it has to wait for all the commands, so it's not as bad to serialize them.
diff --git a/drivers/target/target_core_tmr.c b/drivers/target/target_core_tmr.c index 2b95b4550a63..a60802b4c5a3 100644 --- a/drivers/target/target_core_tmr.c +++ b/drivers/target/target_core_tmr.c @@ -188,9 +188,10 @@ static void core_tmr_drain_tmr_list( * LUN_RESET tmr.. */ spin_lock_irqsave(&dev->se_tmr_lock, flags); - if (tmr) - list_del_init(&tmr->tmr_list); list_for_each_entry_safe(tmr_p, tmr_pp, &dev->dev_tmr_list, tmr_list) { + if (tmr_p == tmr) + continue; + cmd = tmr_p->task_cmd; if (!cmd) { pr_err("Unable to locate struct se_cmd for TMR\n");