diff mbox

scsi: don't count non-failure CHECK_CONDITION as error

Message ID 20160114214602.GC3520@mtj.duckdns.org (mailing list archive)
State Rejected, archived
Headers show

Commit Message

Tejun Heo Jan. 14, 2016, 9:46 p.m. UTC
SCSI command completion path bumps ioerr_cnt whenever scsi_cmd->result
isn't zero; unfortunately, this means that non-error sense reporting
bumps the counter too.  This is pronounced with ATA passthrough
commands because most of them explicitly request the resulting
taskfile to be transported via sense data bumping the count
unconditionally.

Don't bump the counter if scsi_cmd->result simply indicates that sense
data is available.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dave Jones <dsj@fb.com>
---
 drivers/scsi/scsi_lib.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Hannes Reinecke Jan. 15, 2016, 10:04 a.m. UTC | #1
On 01/14/2016 10:46 PM, Tejun Heo wrote:
> SCSI command completion path bumps ioerr_cnt whenever scsi_cmd->result
> isn't zero; unfortunately, this means that non-error sense reporting
> bumps the counter too.  This is pronounced with ATA passthrough
> commands because most of them explicitly request the resulting
> taskfile to be transported via sense data bumping the count
> unconditionally.
>
> Don't bump the counter if scsi_cmd->result simply indicates that sense
> data is available.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Dave Jones <dsj@fb.com>
> ---
>   drivers/scsi/scsi_lib.c |    3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index fa6b2c4..e90e3f7 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1622,7 +1622,8 @@ static void scsi_softirq_done(struct request *rq)
>   	INIT_LIST_HEAD(&cmd->eh_entry);
>
>   	atomic_inc(&cmd->device->iodone_cnt);
> -	if (cmd->result)
> +	if (cmd->result &&
> +	    cmd->result != ((DRIVER_SENSE << 24) | SAM_STAT_CHECK_CONDITION))
>   		atomic_inc(&cmd->device->ioerr_cnt);
>
>   	disposition = scsi_decide_disposition(cmd);

Reviewed-by: Hannes Reinecke <hare@suse.com>

Hannes
James Bottomley Jan. 15, 2016, 3:46 p.m. UTC | #2
On Thu, 2016-01-14 at 16:46 -0500, Tejun Heo wrote:
> SCSI command completion path bumps ioerr_cnt whenever scsi_cmd
> ->result isn't zero; unfortunately, this means that non-error sense 
> reporting bumps the counter too.  This is pronounced with ATA 
> passthrough commands because most of them explicitly request the 
> resulting taskfile to be transported via sense data bumping the count
> unconditionally.
> 
> Don't bump the counter if scsi_cmd->result simply indicates that 
> sense data is available.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Dave Jones <dsj@fb.com>
> ---
>  drivers/scsi/scsi_lib.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index fa6b2c4..e90e3f7 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1622,7 +1622,8 @@ static void scsi_softirq_done(struct request
> *rq)
>  	INIT_LIST_HEAD(&cmd->eh_entry);
>  
>  	atomic_inc(&cmd->device->iodone_cnt);
> -	if (cmd->result)
> +	if (cmd->result &&
> +	    cmd->result != ((DRIVER_SENSE << 24) |
> SAM_STAT_CHECK_CONDITION))
>  		atomic_inc(&cmd->device->ioerr_cnt);

OK, it makes sense to me that we don't include non-error check
conditions.  However, then you shouldn't be checking DRIVER_SENSE.  We
still have a few drivers that rely on the error handler to fetch sense
explicitly ... they could eventually return non-error conditions as
well.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley Jan. 15, 2016, 3:55 p.m. UTC | #3
On Fri, 2016-01-15 at 07:46 -0800, James Bottomley wrote:
> On Thu, 2016-01-14 at 16:46 -0500, Tejun Heo wrote:
> > SCSI command completion path bumps ioerr_cnt whenever scsi_cmd
> > ->result isn't zero; unfortunately, this means that non-error sense
> > reporting bumps the counter too.  This is pronounced with ATA 
> > passthrough commands because most of them explicitly request the 
> > resulting taskfile to be transported via sense data bumping the
> > count
> > unconditionally.
> > 
> > Don't bump the counter if scsi_cmd->result simply indicates that 
> > sense data is available.
> > 
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > Reported-by: Dave Jones <dsj@fb.com>
> > ---
> >  drivers/scsi/scsi_lib.c |    3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index fa6b2c4..e90e3f7 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -1622,7 +1622,8 @@ static void scsi_softirq_done(struct request
> > *rq)
> >  	INIT_LIST_HEAD(&cmd->eh_entry);
> >  
> >  	atomic_inc(&cmd->device->iodone_cnt);
> > -	if (cmd->result)
> > +	if (cmd->result &&
> > +	    cmd->result != ((DRIVER_SENSE << 24) |
> > SAM_STAT_CHECK_CONDITION))
> >  		atomic_inc(&cmd->device->ioerr_cnt);
> 
> OK, it makes sense to me that we don't include non-error check
> conditions.  However, then you shouldn't be checking DRIVER_SENSE. 
>  We still have a few drivers that rely on the error handler to fetch
> sense explicitly ... they could eventually return non-error 
> conditions as well.

Actually, I take this back: if we add your proposal, we never increment
the ioerr_cnt even for sense returns indicating failure.  That looks to
be even worse than incrementing it too often.

The other problem is that if we do this for you, we should do the same
for SCSI with BUSY and QUEUE_FULL ... they indicate temporary retry
conditions and shouldn't be treated as errors.

I'll stop looking now before I find any more problems with the
statistics code ...  I think it needs a rethink.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Douglas Gilbert Jan. 15, 2016, 4:50 p.m. UTC | #4
On 16-01-15 04:55 PM, James Bottomley wrote:
> On Fri, 2016-01-15 at 07:46 -0800, James Bottomley wrote:
>> On Thu, 2016-01-14 at 16:46 -0500, Tejun Heo wrote:
>>> SCSI command completion path bumps ioerr_cnt whenever scsi_cmd
>>> ->result isn't zero; unfortunately, this means that non-error sense
>>> reporting bumps the counter too.  This is pronounced with ATA
>>> passthrough commands because most of them explicitly request the
>>> resulting taskfile to be transported via sense data bumping the
>>> count
>>> unconditionally.
>>>
>>> Don't bump the counter if scsi_cmd->result simply indicates that
>>> sense data is available.
>>>
>>> Signed-off-by: Tejun Heo <tj@kernel.org>
>>> Reported-by: Dave Jones <dsj@fb.com>
>>> ---
>>>   drivers/scsi/scsi_lib.c |    3 ++-
>>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>>> index fa6b2c4..e90e3f7 100644
>>> --- a/drivers/scsi/scsi_lib.c
>>> +++ b/drivers/scsi/scsi_lib.c
>>> @@ -1622,7 +1622,8 @@ static void scsi_softirq_done(struct request
>>> *rq)
>>>   	INIT_LIST_HEAD(&cmd->eh_entry);
>>>
>>>   	atomic_inc(&cmd->device->iodone_cnt);
>>> -	if (cmd->result)
>>> +	if (cmd->result &&
>>> +	    cmd->result != ((DRIVER_SENSE << 24) |
>>> SAM_STAT_CHECK_CONDITION))
>>>   		atomic_inc(&cmd->device->ioerr_cnt);
>>
>> OK, it makes sense to me that we don't include non-error check
>> conditions.  However, then you shouldn't be checking DRIVER_SENSE.
>>   We still have a few drivers that rely on the error handler to fetch
>> sense explicitly ... they could eventually return non-error
>> conditions as well.
>
> Actually, I take this back: if we add your proposal, we never increment
> the ioerr_cnt even for sense returns indicating failure.  That looks to
> be even worse than incrementing it too often.
>
> The other problem is that if we do this for you, we should do the same
> for SCSI with BUSY and QUEUE_FULL ... they indicate temporary retry
> conditions and shouldn't be treated as errors.
>
> I'll stop looking now before I find any more problems with the
> statistics code ...  I think it needs a rethink.

SCSI status and sense data is non-trivial to decode. It looks
like someone thought that one-liner would bypass a lot of hard
work. Most of the time sense data indicates an error but not
always. Even worse, it can contain vendor specific codes. If
this statistic is on the fast path then IMO it should be retired
(and any others like it). For backward compatibility set it to
0 once at initialization and document the change. Or you could
have a discouraging kernel config option such as
CONFIG_EXPENSIVE_SCSI_STATISTICS ("nonsensical" is another term
that comes to mind).

Doug Gilbert


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley Jan. 15, 2016, 6:35 p.m. UTC | #5
On Fri, 2016-01-15 at 17:50 +0100, Douglas Gilbert wrote:
> On 16-01-15 04:55 PM, James Bottomley wrote:
> > On Fri, 2016-01-15 at 07:46 -0800, James Bottomley wrote:
> > > On Thu, 2016-01-14 at 16:46 -0500, Tejun Heo wrote:
> > > > SCSI command completion path bumps ioerr_cnt whenever scsi_cmd
> > > > ->result isn't zero; unfortunately, this means that non-error
> > > > sense
> > > > reporting bumps the counter too.  This is pronounced with ATA
> > > > passthrough commands because most of them explicitly request
> > > > the
> > > > resulting taskfile to be transported via sense data bumping the
> > > > count
> > > > unconditionally.
> > > > 
> > > > Don't bump the counter if scsi_cmd->result simply indicates
> > > > that
> > > > sense data is available.
> > > > 
> > > > Signed-off-by: Tejun Heo <tj@kernel.org>
> > > > Reported-by: Dave Jones <dsj@fb.com>
> > > > ---
> > > >   drivers/scsi/scsi_lib.c |    3 ++-
> > > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > > > index fa6b2c4..e90e3f7 100644
> > > > --- a/drivers/scsi/scsi_lib.c
> > > > +++ b/drivers/scsi/scsi_lib.c
> > > > @@ -1622,7 +1622,8 @@ static void scsi_softirq_done(struct
> > > > request
> > > > *rq)
> > > >   	INIT_LIST_HEAD(&cmd->eh_entry);
> > > > 
> > > >   	atomic_inc(&cmd->device->iodone_cnt);
> > > > -	if (cmd->result)
> > > > +	if (cmd->result &&
> > > > +	    cmd->result != ((DRIVER_SENSE << 24) |
> > > > SAM_STAT_CHECK_CONDITION))
> > > >   		atomic_inc(&cmd->device->ioerr_cnt);
> > > 
> > > OK, it makes sense to me that we don't include non-error check
> > > conditions.  However, then you shouldn't be checking
> > > DRIVER_SENSE.
> > >   We still have a few drivers that rely on the error handler to
> > > fetch
> > > sense explicitly ... they could eventually return non-error
> > > conditions as well.
> > 
> > Actually, I take this back: if we add your proposal, we never
> > increment
> > the ioerr_cnt even for sense returns indicating failure.  That
> > looks to
> > be even worse than incrementing it too often.
> > 
> > The other problem is that if we do this for you, we should do the
> > same
> > for SCSI with BUSY and QUEUE_FULL ... they indicate temporary retry
> > conditions and shouldn't be treated as errors.
> > 
> > I'll stop looking now before I find any more problems with the
> > statistics code ...  I think it needs a rethink.
> 
> SCSI status and sense data is non-trivial to decode. It looks
> like someone thought that one-liner would bypass a lot of hard
> work. Most of the time sense data indicates an error but not
> always. Even worse, it can contain vendor specific codes. If
> this statistic is on the fast path then IMO it should be retired
> (and any others like it). For backward compatibility set it to
> 0 once at initialization and document the change. Or you could
> have a discouraging kernel config option such as
> CONFIG_EXPENSIVE_SCSI_STATISTICS ("nonsensical" is another term
> that comes to mind).

Well, I can see sense in having an error count of everything that comes
back that's not good status because it's easy and has a well defined
meaning (calling it the "error count" is more debatable, agreed).  It
appears that Dave and Tejun want the count to mean something else. 
 Lets treat this as a feature exercise: Dave and Tejun, what do you
want, then we can see if we could add an additional counter giving you
that.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Jan. 15, 2016, 6:42 p.m. UTC | #6
Hello, James.

On Fri, Jan 15, 2016 at 10:35:34AM -0800, James Bottomley wrote:
> Well, I can see sense in having an error count of everything that comes
> back that's not good status because it's easy and has a well defined
> meaning (calling it the "error count" is more debatable, agreed).  It
> appears that Dave and Tejun want the count to mean something else. 
>  Lets treat this as a feature exercise: Dave and Tejun, what do you
> want, then we can see if we could add an additional counter giving you
> that.

Well, currently, for libata devices, all passthrough commands bump it
up because the result taskfile is reported via sense data, which is
pretty misleading for something named ioerr_cnt.  Maybe we can ignore
CHECK_CONDITION for ATA passthrough commands but special casing them
can be confusing in other ways.

Thanks.
James Bottomley Jan. 15, 2016, 7:09 p.m. UTC | #7
On Fri, 2016-01-15 at 13:42 -0500, Tejun Heo wrote:
> Hello, James.
> 
> On Fri, Jan 15, 2016 at 10:35:34AM -0800, James Bottomley wrote:
> > Well, I can see sense in having an error count of everything that
> > comes
> > back that's not good status because it's easy and has a well
> > defined
> > meaning (calling it the "error count" is more debatable, agreed). 
> >  It
> > appears that Dave and Tejun want the count to mean something else. 
> >  Lets treat this as a feature exercise: Dave and Tejun, what do you
> > want, then we can see if we could add an additional counter giving
> > you
> > that.
> 
> Well, currently, for libata devices, all passthrough commands bump it
> up because the result taskfile is reported via sense data, which is
> pretty misleading for something named ioerr_cnt.  Maybe we can ignore
> CHECK_CONDITION for ATA passthrough commands but special casing them
> can be confusing in other ways.

That doesn't really help me pin down what you want.  I already said
ioerr_cnt is misleading, but we're stuck with the name, becuase it's an
ABI.

Under the definition I gave, the behaviour you currently see is
correct: all commands with non good status count as errors.  That
includes a ton of stuff I wouldn't classify as real errors, like queue
full status, deferred sense codes and even auto correct, which is why I
believe the term "error count" is misleading, but, as I said, we're
stuck with it.

If we want to change what is being counted, we have to change the
definition, so what is the definition you want to see for counting
errors? and what's the reason driving this change?

James

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Jan. 15, 2016, 7:27 p.m. UTC | #8
Hello, James.

On Fri, Jan 15, 2016 at 11:09:16AM -0800, James Bottomley wrote:
> If we want to change what is being counted, we have to change the
> definition, so what is the definition you want to see for counting
> errors? and what's the reason driving this change?

There isn't anything else to it than what I already wrote.  It goes up
for things which aren't failures making the counter essentially
useless.  It's confusing from userland as it's unintuitive that these
commands are wrapped in SCSI ATA passthrough commands which use sense
data for result reporting and doing equivalent operations on native
SCSI and ATA devices lead to different outcomes.  The counter as-is is
just useless for libata devices.  If it can be rectified easily,
great.  If not, it isn't anything critical.

Thanks.
James Bottomley Jan. 15, 2016, 7:36 p.m. UTC | #9
On Fri, 2016-01-15 at 14:27 -0500, Tejun Heo wrote:
> Hello, James.
> 
> On Fri, Jan 15, 2016 at 11:09:16AM -0800, James Bottomley wrote:
> > If we want to change what is being counted, we have to change the
> > definition, so what is the definition you want to see for counting
> > errors? and what's the reason driving this change?
> 
> There isn't anything else to it than what I already wrote.  It goes
> up
> for things which aren't failures making the counter essentially
> useless.  It's confusing from userland as it's unintuitive that these
> commands are wrapped in SCSI ATA passthrough commands which use sense
> data for result reporting and doing equivalent operations on native
> SCSI and ATA devices lead to different outcomes.  The counter as-is
> is
> just useless for libata devices.  If it can be rectified easily,
> great.  If not, it isn't anything critical.

That's the point we were making: it can't be rectified easily.  Simply
counting non-good returns is easy.  Counting something that's closer to
errors is hard because first we have to define what an "error" is (I
could see queue full being not an error, but what about not ready or
initializing command required?) and then implement and infrastructure
to classify and count the returns, which means an awful lot of sense
data parsing.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Jan. 15, 2016, 7:40 p.m. UTC | #10
On Fri, Jan 15, 2016 at 11:36:08AM -0800, James Bottomley wrote:
> That's the point we were making: it can't be rectified easily.  Simply
> counting non-good returns is easy.  Counting something that's closer to
> errors is hard because first we have to define what an "error" is (I
> could see queue full being not an error, but what about not ready or
> initializing command required?) and then implement and infrastructure
> to classify and count the returns, which means an awful lot of sense
> data parsing.

Yeap, no objection.  Leaving as-is fine.
diff mbox

Patch

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index fa6b2c4..e90e3f7 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1622,7 +1622,8 @@  static void scsi_softirq_done(struct request *rq)
 	INIT_LIST_HEAD(&cmd->eh_entry);
 
 	atomic_inc(&cmd->device->iodone_cnt);
-	if (cmd->result)
+	if (cmd->result &&
+	    cmd->result != ((DRIVER_SENSE << 24) | SAM_STAT_CHECK_CONDITION))
 		atomic_inc(&cmd->device->ioerr_cnt);
 
 	disposition = scsi_decide_disposition(cmd);