diff mbox

scsi: ata: don't reset three times if device is offline for SAS host

Message ID 1516800025-193614-1-git-send-email-chenxiang66@hisilicon.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

chenxiang Jan. 24, 2018, 1:20 p.m. UTC
In ata_eh_reset, it will reset three times at most for sata disk. For
some drivers through libsas, it calls sas_ata_hard_reset at last. When
device is gone, function sas_ata_hard_reset will return -ENODEV. But
it will still try to reset three times for offline device. This process
lasts a long time:

[11248.344323] ata13.00: status: { ERR }
[11248.344324] ata13.00: error: { ABRT }
[11248.344327] ata13: hard resetting link
[11248.503557] sas: ata: ex 500e004aaaaaaa1f phy02:U:A attached:0000000000000000 (no device)
[11249.359524] sas: ata13: end_device-1:0:2: reset failed (errno=-19)[eta03d:19h:35m:17s]
[11249.365692] ata13: reset failed (errno=-19), retrying in 9 secs
[11258.451402] ata13: hard resetting linkKB/0KB/0KB /s] [0/0/0 iops][eta 03d:22h:10m:48s]
[11259.411508] sas: ata13: end_device-1:0:2: reset failed (errno=-19)[eta 03d:22h:28m:05s]
[11259.417683] ata13: reset failed (errno=-19), retrying in 10 secs
[11268.695401] ata13: hard resetting linkKB/0KB/0KB /s] [0/0/0 iops] [eta 04d:01h:03m:37s]
[11269.699513] sas: ata13: end_device-1:0:2: reset failed (errno=-19)[eta 04d:01h:20m:54s]
[11269.705689] ata13: reset failed (errno=-19), retrying in 34 secs
[11304.275393] ata13: hard resetting linkKB/0KB/0KB /s] [0/0/0 iops] [eta 04d:11h:25m:43s]
[11305.283516] sas: ata13: end_device-1:0:2: reset failed (errno=-19)[eta 04d:11h:43m:00s]
[11305.289692] ata13: reset failed, giving up
[11305.293785] ata13.00: disabled

Actually it is no need to reset three times for this scenario. So add
a check to avoid it.

Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
---
 drivers/ata/libata-eh.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Tejun Heo Feb. 12, 2018, 4:51 p.m. UTC | #1
Hello,

On Wed, Jan 24, 2018 at 09:20:25PM +0800, chenxiang wrote:
> In ata_eh_reset, it will reset three times at most for sata disk. For
> some drivers through libsas, it calls sas_ata_hard_reset at last. When
> device is gone, function sas_ata_hard_reset will return -ENODEV. But
> it will still try to reset three times for offline device. This process
> lasts a long time:
> 
> [11248.344323] ata13.00: status: { ERR }
> [11248.344324] ata13.00: error: { ABRT }
> [11248.344327] ata13: hard resetting link
> [11248.503557] sas: ata: ex 500e004aaaaaaa1f phy02:U:A attached:0000000000000000 (no device)
> [11249.359524] sas: ata13: end_device-1:0:2: reset failed (errno=-19)[eta03d:19h:35m:17s]
> [11249.365692] ata13: reset failed (errno=-19), retrying in 9 secs
> [11258.451402] ata13: hard resetting linkKB/0KB/0KB /s] [0/0/0 iops][eta 03d:22h:10m:48s]
> [11259.411508] sas: ata13: end_device-1:0:2: reset failed (errno=-19)[eta 03d:22h:28m:05s]
> [11259.417683] ata13: reset failed (errno=-19), retrying in 10 secs
> [11268.695401] ata13: hard resetting linkKB/0KB/0KB /s] [0/0/0 iops] [eta 04d:01h:03m:37s]
> [11269.699513] sas: ata13: end_device-1:0:2: reset failed (errno=-19)[eta 04d:01h:20m:54s]
> [11269.705689] ata13: reset failed (errno=-19), retrying in 34 secs
> [11304.275393] ata13: hard resetting linkKB/0KB/0KB /s] [0/0/0 iops] [eta 04d:11h:25m:43s]
> [11305.283516] sas: ata13: end_device-1:0:2: reset failed (errno=-19)[eta 04d:11h:43m:00s]
> [11305.289692] ata13: reset failed, giving up
> [11305.293785] ata13.00: disabled
> 
> Actually it is no need to reset three times for this scenario. So add
> a check to avoid it.

I'm a bit reluctant in changing this per-driver.  Does this actually
hurt something?

Thanks.
chenxiang Feb. 13, 2018, 1:44 a.m. UTC | #2
Hi Tejun,

在 2018/2/13 0:51, Tejun Heo 写道:
> Hello,
>
> On Wed, Jan 24, 2018 at 09:20:25PM +0800, chenxiang wrote:
>> In ata_eh_reset, it will reset three times at most for sata disk. For
>> some drivers through libsas, it calls sas_ata_hard_reset at last. When
>> device is gone, function sas_ata_hard_reset will return -ENODEV. But
>> it will still try to reset three times for offline device. This process
>> lasts a long time:
>>
>> [11248.344323] ata13.00: status: { ERR }
>> [11248.344324] ata13.00: error: { ABRT }
>> [11248.344327] ata13: hard resetting link
>> [11248.503557] sas: ata: ex 500e004aaaaaaa1f phy02:U:A attached:0000000000000000 (no device)
>> [11249.359524] sas: ata13: end_device-1:0:2: reset failed (errno=-19)[eta03d:19h:35m:17s]
>> [11249.365692] ata13: reset failed (errno=-19), retrying in 9 secs
>> [11258.451402] ata13: hard resetting linkKB/0KB/0KB /s] [0/0/0 iops][eta 03d:22h:10m:48s]
>> [11259.411508] sas: ata13: end_device-1:0:2: reset failed (errno=-19)[eta 03d:22h:28m:05s]
>> [11259.417683] ata13: reset failed (errno=-19), retrying in 10 secs
>> [11268.695401] ata13: hard resetting linkKB/0KB/0KB /s] [0/0/0 iops] [eta 04d:01h:03m:37s]
>> [11269.699513] sas: ata13: end_device-1:0:2: reset failed (errno=-19)[eta 04d:01h:20m:54s]
>> [11269.705689] ata13: reset failed (errno=-19), retrying in 34 secs
>> [11304.275393] ata13: hard resetting linkKB/0KB/0KB /s] [0/0/0 iops] [eta 04d:11h:25m:43s]
>> [11305.283516] sas: ata13: end_device-1:0:2: reset failed (errno=-19)[eta 04d:11h:43m:00s]
>> [11305.289692] ata13: reset failed, giving up
>> [11305.293785] ata13.00: disabled
>>
>> Actually it is no need to reset three times for this scenario. So add
>> a check to avoid it.
> I'm a bit reluctant in changing this per-driver.  Does this actually
> hurt something?

For those drivers using libsas,  i think they have the same issue. It 
takes about 1 minute to
recover but actually device is gone, so this recover is useless for this 
scenario (when enter EH,
all normal IOs are blocked actually, so it will cause normal IOs are 
blocked one more minute which
user doesn't want to).
Actually in sas_ata_hard_reset, there are two situations returned 
-ENODEV which represent device is gone:
- LLDD directly returns -ENODEV through lldd_I_T_nexus_reset;
- It sends SMP DISCOVER to check local phy in smp_ata_check_ready, and 
find it is gone;

> Thanks.
>
Tejun Heo Feb. 13, 2018, 2:27 p.m. UTC | #3
Hello,

On Tue, Feb 13, 2018 at 09:44:53AM +0800, chenxiang (M) wrote:
> For those drivers using libsas,  i think they have the same issue.
> It takes about 1 minute to
> recover but actually device is gone, so this recover is useless for
> this scenario (when enter EH,
> all normal IOs are blocked actually, so it will cause normal IOs are
> blocked one more minute which
> user doesn't want to).

Right, it'd block other devices sharing the port.  Doesn't sas map
each ata device to its own port tho?

> Actually in sas_ata_hard_reset, there are two situations returned
> -ENODEV which represent device is gone:
> - LLDD directly returns -ENODEV through lldd_I_T_nexus_reset;
> - It sends SMP DISCOVER to check local phy in smp_ata_check_ready,
> and find it is gone;

So, if there are real consequences, we can definitely add a way to
short-circuit the recovery logic but let's do that by adding proper
signaling rathr than testing for driver type.

Thanks.
chenxiang Feb. 26, 2018, 11:45 a.m. UTC | #4
Hi Tejun,
Sorry for my late reply as i have a vacation last week.

在 2018/2/13 22:27, Tejun Heo 写道:
> Hello,
>
> On Tue, Feb 13, 2018 at 09:44:53AM +0800, chenxiang (M) wrote:
>> For those drivers using libsas,  i think they have the same issue.
>> It takes about 1 minute to
>> recover but actually device is gone, so this recover is useless for
>> this scenario (when enter EH,
>> all normal IOs are blocked actually, so it will cause normal IOs are
>> blocked one more minute which
>> user doesn't want to).
> Right, it'd block other devices sharing the port.  Doesn't sas map
> each ata device to its own port tho?

Yes, for ata devices connected with expander, all the devices share the 
same port.

>
>> Actually in sas_ata_hard_reset, there are two situations returned
>> -ENODEV which represent device is gone:
>> - LLDD directly returns -ENODEV through lldd_I_T_nexus_reset;
>> - It sends SMP DISCOVER to check local phy in smp_ata_check_ready,
>> and find it is gone;
> So, if there are real consequences, we can definitely add a way to
> short-circuit the recovery logic but let's do that by adding proper
> signaling rathr than testing for driver type.

I am not familiar with ata recovery logic, and do you have idea about 
how to add a way to
short-circuit the recovery logic by adding proper signaling?

>
> Thanks.
>
Tejun Heo Feb. 27, 2018, 6:19 p.m. UTC | #5
Hello,

On Mon, Feb 26, 2018 at 07:45:37PM +0800, chenxiang (M) wrote:
> >So, if there are real consequences, we can definitely add a way to
> >short-circuit the recovery logic but let's do that by adding proper
> >signaling rathr than testing for driver type.
> 
> I am not familiar with ata recovery logic, and do you have idea
> about how to add a way to
> short-circuit the recovery logic by adding proper signaling?

e.g. define a return value from reset function which indicates no
retry or introduce a port flag.  Basically, something which doens't
add special casing logic to the core logic.

Thanks.
chenxiang Feb. 28, 2018, 7:18 a.m. UTC | #6
Hi Tejun,

在 2018/2/28 2:19, Tejun Heo 写道:
> Hello,
>
> On Mon, Feb 26, 2018 at 07:45:37PM +0800, chenxiang (M) wrote:
>>> So, if there are real consequences, we can definitely add a way to
>>> short-circuit the recovery logic but let's do that by adding proper
>>> signaling rathr than testing for driver type.
>> I am not familiar with ata recovery logic, and do you have idea
>> about how to add a way to
>> short-circuit the recovery logic by adding proper signaling?
> e.g. define a return value from reset function which indicates no
> retry or introduce a port flag.  Basically, something which doens't
> add special casing logic to the core logic.

If we can introduce a port flags such as ATA_LFLAG_DISABLED or 
ATA_EHI_NO_RECOVERY before ata_eh_recover,
it will skip recovery. But we only get device status NODEV from 
ata_do_reset which is after ata_eh_recover, i don't
think it is used to introduce a port flags at that time.
 From function ata_eh_reset, it seems there are two situations that end 
the recovery logic in current code:
1. Retry 3 times;
2. Reset function return value -ERESTART, but this return value seems be 
specal situation for ATA;


>
> Thanks.
>
Tejun Heo March 1, 2018, 9:57 p.m. UTC | #7
Hello,

On Wed, Feb 28, 2018 at 03:18:39PM +0800, chenxiang (M) wrote:
> If we can introduce a port flags such as ATA_LFLAG_DISABLED or
> ATA_EHI_NO_RECOVERY before ata_eh_recover,
> it will skip recovery. But we only get device status NODEV from
> ata_do_reset which is after ata_eh_recover, i don't
> think it is used to introduce a port flags at that time.
> From function ata_eh_reset, it seems there are two situations that
> end the recovery logic in current code:
> 1. Retry 3 times;
> 2. Reset function return value -ERESTART, but this return value
> seems be specal situation for ATA;

So, we have -ERESTART for restarting, -EPIPE for speeding down.  We
can just add another special return code - say, -ENOENT - to indicate
that there's nothing and it shouldn't keep trying.

Thanks.
diff mbox

Patch

diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index 11c3137..23a8946 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -3032,6 +3032,15 @@  int ata_eh_reset(struct ata_link *link, int classify,
 		goto out;
 	}
 
+	if (rc == -ENODEV && ap->flags & ATA_FLAG_SAS_HOST) {
+		ata_link_warn(failed_link,
+			"reset failed (errno=%d, device is offline for SAS host\n)",
+			rc);
+		if (ata_is_host_link(link))
+			ata_eh_thaw_port(ap);
+		goto out;
+	}
+
 	now = jiffies;
 	if (time_before(now, deadline)) {
 		unsigned long delta = deadline - now;