diff mbox series

[2/3] scsi: pm80xx: Do not issue hard reset before NCQ EH

Message ID 20240607175743.3986625-3-tadamsjr@google.com (mailing list archive)
State Superseded
Headers show
Series small pm80xx driver fixes | expand

Commit Message

TJ Adams June 7, 2024, 5:57 p.m. UTC
From: Igor Pylypiv <ipylypiv@google.com>

v6.2 commit 811be570a9a8 ("scsi: pm8001: Use sas_ata_device_link_abort()
to handle NCQ errors") removed duplicate NCQ EH from the pm80xx driver
and started relying on libata to handle the NCQ errors. The PM8006
controller has a special EH sequence that was added in v4.15 commit
869ddbdcae3b ("scsi: pm80xx: corrected SATA abort handling sequence.").
The special EH sequence issues a hard reset to a drive before libata EH
has a chance to read the NCQ log page. Libata EH gets confused by empty
NCQ log page which results in HSM violation. The failed command gets
retried a few times and each time fails with the same HSM violation.
Finally, libata decides to disable NCQ due to subsequent HSM vioaltions.

To avoid unwanted hard resets we can initiate abort all from the driver
to prevent libsas EH from calling lldd_abort_task()/pm8001_abort_task().

Signed-off-by: Igor Pylypiv <ipylypiv@google.com>
Signed-off-by: Terrence Adams <tadamsjr@google.com>
---
 drivers/scsi/pm8001/pm8001_hwi.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Niklas Cassel June 11, 2024, 8:40 a.m. UTC | #1
Hello Igor, TJ,

On Fri, Jun 07, 2024 at 05:57:42PM +0000, TJ Adams wrote:
> From: Igor Pylypiv <ipylypiv@google.com>
> 
> v6.2 commit 811be570a9a8 ("scsi: pm8001: Use sas_ata_device_link_abort()

Do not specify kernel version (it is irrelevant), SHA1 is enough.


> to handle NCQ errors") removed duplicate NCQ EH from the pm80xx driver
> and started relying on libata to handle the NCQ errors. The PM8006
> controller has a special EH sequence that was added in v4.15 commit
> 869ddbdcae3b ("scsi: pm80xx: corrected SATA abort handling sequence.").

Do not specify kernel version (it is irrelevant), SHA1 is enough.

Since the code added in 869ddbdcae3b still exists in the pm80xx driver,
I think that you should mention the commits in chronological order.
(Right now you mention the oldest still existing code last, which seems
a bit backwards.)


> The special EH sequence issues a hard reset to a drive before libata EH
> has a chance to read the NCQ log page. Libata EH gets confused by empty
> NCQ log page which results in HSM violation. The failed command gets
> retried a few times and each time fails with the same HSM violation.
> Finally, libata decides to disable NCQ due to subsequent HSM vioaltions.

s/vioaltions/violations/

I'm not an expert in libsas EH, but I think that your commit message fails
to explain why this change actually fixes anything. You do not mention the
relationship between the code that you add pm8001_work_fn() and the
existing code in pm8001_abort_task(), and the order in which the functions
get executed.

Does calling sas_execute_internal_abort_dev() from pm8001_work_fn() ensure
that the libsas EH is never invoked? Or does it cancel the hard reset that
is part of the "special EH sequence" in pm8001_abort_task() ?

Wouldn't it be better if this was fixed in pm8001_abort_task() or similar
instead? It appears that the code you add to pm8001_work_fn() (that has a
very ugly if (pm8006)) is only there to undo or avoid the hard reset that
is done in pm8001_abort_task() (which also has a very ugly if (pm8006)).

Now we have this ugly if (pm8006) in two different functions... which
makes my "this could be solved in a nicer way" detector go off.

If this patch (as is) is really the way to go, then I think there should
be a more detailed reasoning why this change is the most sensible one.


Kind regards,
Niklas
Igor Pylypiv June 14, 2024, 9:41 p.m. UTC | #2
On Tue, Jun 11, 2024 at 10:40:48AM +0200, Niklas Cassel wrote:
> Hello Igor, TJ,
> 
Hi Niklas,

Thank you for the feedback!

> On Fri, Jun 07, 2024 at 05:57:42PM +0000, TJ Adams wrote:
> > From: Igor Pylypiv <ipylypiv@google.com>
> > 
> > v6.2 commit 811be570a9a8 ("scsi: pm8001: Use sas_ata_device_link_abort()
> 
> Do not specify kernel version (it is irrelevant), SHA1 is enough.
> 
Noted.

> 
> > to handle NCQ errors") removed duplicate NCQ EH from the pm80xx driver
> > and started relying on libata to handle the NCQ errors. The PM8006
> > controller has a special EH sequence that was added in v4.15 commit
> > 869ddbdcae3b ("scsi: pm80xx: corrected SATA abort handling sequence.").
> 
> Do not specify kernel version (it is irrelevant), SHA1 is enough.
> 
> Since the code added in 869ddbdcae3b still exists in the pm80xx driver,
> I think that you should mention the commits in chronological order.
> (Right now you mention the oldest still existing code last, which seems
> a bit backwards.)
>
Noted. I wanted to emphasise that newer commit 811be570a9a8 broke the NCQ EH
for pm8006 so I put it first. I should have added a Fixes tag to make it
clear. 

> 
> > The special EH sequence issues a hard reset to a drive before libata EH
> > has a chance to read the NCQ log page. Libata EH gets confused by empty
> > NCQ log page which results in HSM violation. The failed command gets
> > retried a few times and each time fails with the same HSM violation.
> > Finally, libata decides to disable NCQ due to subsequent HSM vioaltions.
> 
> s/vioaltions/violations/
> 
> I'm not an expert in libsas EH, but I think that your commit message fails
> to explain why this change actually fixes anything. You do not mention the
> relationship between the code that you add pm8001_work_fn() and the
> existing code in pm8001_abort_task(), and the order in which the functions
> get executed.
> 
Noted, will update with more details.

> Does calling sas_execute_internal_abort_dev() from pm8001_work_fn() ensure
> that the libsas EH is never invoked? Or does it cancel the hard reset that
> is part of the "special EH sequence" in pm8001_abort_task() ?
> 
It ensures that all I/Os are aborted before libsas EH kicks in. Since all
I/Os are aborted by the controller the pm8001_abort_task() will not be called.
Going to add that to commit message as well.

> Wouldn't it be better if this was fixed in pm8001_abort_task() or similar
> instead? It appears that the code you add to pm8001_work_fn() (that has a
> very ugly if (pm8006)) is only there to undo or avoid the hard reset that
> is done in pm8001_abort_task() (which also has a very ugly if (pm8006)).
>

It would definetely be better to fix this in pm8001_abort_task(), if possible.
One way would be to add a flag that will be set when NCQ error happens (when
IO_XFER_ERROR_ABORTED_NCQ_MODE event is received) and then check that flag
in pm8001_abort_task() to skip hard reset. This approach adds another type
of ugliness to the code and I'm not sure which of these ugly approaches is
less ugly.

> Now we have this ugly if (pm8006) in two different functions... which
> makes my "this could be solved in a nicer way" detector go off.
> 

I would be very happy if we can drop those ugly if (pm8006) checks and have
a generic nice solution.

> If this patch (as is) is really the way to go, then I think there should
> be a more detailed reasoning why this change is the most sensible one.
> 

Let me investigate this issue more to see if there is a way to drop 
the ugly pm8006 checks.

Any ideas/suggestions on how to fix this nicely are very welcomed.

> 
> Kind regards,
> Niklas

Thank you,
Igor
diff mbox series

Patch

diff --git a/drivers/scsi/pm8001/pm8001_hwi.c b/drivers/scsi/pm8001/pm8001_hwi.c
index dec1e2d380f1..f19f76dc6e1c 100644
--- a/drivers/scsi/pm8001/pm8001_hwi.c
+++ b/drivers/scsi/pm8001/pm8001_hwi.c
@@ -1672,7 +1672,18 @@  void pm8001_work_fn(struct work_struct *work)
 	break;
 	case IO_XFER_ERROR_ABORTED_NCQ_MODE:
 	{
+		struct pm8001_hba_info *pm8001_ha = pw->pm8001_ha;
 		dev = pm8001_dev->sas_device;
+		/*
+		 * pm8001_abort_task() issues a hard reset to a drive
+		 * before libata EH has a chance to read the NCQ log page.
+		 *
+		 * Initiate abort all from the driver to prevent libsas EH
+		 * from calling lldd_abort_task() / pm8001_abort_task().
+		 */
+		if (pm8001_ha->chip_id == chip_8006)
+			sas_execute_internal_abort_dev(dev, 0, NULL);
+
 		sas_ata_device_link_abort(dev, false);
 	}
 	break;