[1/2] scsi: st: Remove use of device->was_reset

Message ID	20241115162003.3908-2-Kai.Makisara@kolumbus.fi (mailing list archive)
State	New
Headers	show Received: from fgw23-4.mail.saunalahti.fi (fgw23-4.mail.saunalahti.fi [62.142.5.110]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 450F71D47B4 for <linux-scsi@vger.kernel.org>; Fri, 15 Nov 2024 16:20:45 +0000 (UTC) From: =?utf-8?q?Kai_M=C3=A4kisara?= <Kai.Makisara@kolumbus.fi> To: linux-scsi@vger.kernel.org, jmeneghi@redhat.com Cc: martin.petersen@oracle.com, James.Bottomley@HansenPartnership.com, loberman@redhat.com, =?utf-8?q?Kai_M=C3=A4kisara?= <Kai.Makisara@kolumbus.fi> Subject: [PATCH 1/2] scsi: st: Remove use of device->was_reset Date: Fri, 15 Nov 2024 18:20:02 +0200 Message-ID: <20241115162003.3908-2-Kai.Makisara@kolumbus.fi> In-Reply-To: <20241115162003.3908-1-Kai.Makisara@kolumbus.fi> References: <20241115162003.3908-1-Kai.Makisara@kolumbus.fi> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	scsi: st: More reset patches \| expand [0/2] scsi: st: More reset patches [1/2] scsi: st: Remove use of device->was_reset [2/2] scsi: st: Restore some drive settings after reset

Kai Mäkisara (Kolumbus) Nov. 15, 2024, 4:20 p.m. UTC

Don't use the was_reset flag set by scsi error handling. It is enough to
recognize device resets based in the UNIT ATTENTION sense data. (The
retry counts are zero and either REQ_OP_DRV_OUT or REC_OP_DRV_IN are
used to prevent retries by midlevel.)

Signed-off-by: Kai Mäkisara <Kai.Makisara@kolumbus.fi>
---
 drivers/scsi/st.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

Kai Mäkisara (Kolumbus) Nov. 15, 2024, 5:45 p.m. UTC | #1

> On 15. Nov 2024, at 18.20, Kai Mäkisara <Kai.Makisara@kolumbus.fi> wrote:
> 
> Don't use the was_reset flag set by scsi error handling. It is enough to
> recognize device resets based in the UNIT ATTENTION sense data. (The
> retry counts are zero and either REQ_OP_DRV_OUT or REC_OP_DRV_IN are
> used to prevent retries by midlevel.)
> 

Please hold this patch until the problem below has been solved.

The following came into my mind when I looked at John's debugging data he
sent to linux-scsi today.

I still think that UNIT ATTENTIONs (UAs) reach the high level device without
problems. The problem is that the device attached to the target first issuing
a SCSI command after reset gets the UA. As long as this is st device,
there are no problems. But, if it is the sg device attached to the same target,
the tape device misses it.

The device->was_reset flag stays set in many (most?) cases forever, unless
someone resets it. (Scsi_error resets it only if the associated kthread ends
up locking the device door.) Because the flag stays set, the st device can
notice it even if the sg device gets the UA.

Using a flag set by an other layer is ugly (to put it mildly). Even uglier is that
st clears the flag when it has noticed that it is set.

And there are cases where the device reset does not originate from the
same computer.

Does anyone have any suggestions?

Kai Mäkisara (Kolumbus) Nov. 16, 2024, 3:35 p.m. UTC | #2

> On 15. Nov 2024, at 19.45, Kai Mäkisara (Kolumbus) <kai.makisara@kolumbus.fi> wrote:
> 
...
> 
> I still think that UNIT ATTENTIONs (UAs) reach the high level device without
> problems. The problem is that the device attached to the target first issuing
> a SCSI command after reset gets the UA. As long as this is st device,
> there are no problems. But, if it is the sg device attached to the same target,
> the tape device misses it.
> 
...
> 
> And there are cases where the device reset does not originate from the
> same computer.
> 
> Does anyone have any suggestions?

Besides Power on/Reset, the same problem applies to the New Media case.

One solution might be the following: the midlevel maintains counters for
the Power on/Reset and the Media change UNIT ATTENTIONs. The ULDs
can read these counters (using wrappers). If the ULD find for a device that
the counter value has changed, then the event corresponding to the counter
has occurred. The problem of clearing event flags is avoided,

A drawback comes from the counter wrap-arounds. If, e.g., four-bit counters
are used and there are 16 Power on/Resets between the checks by the ULD,
the event is missed when the counter is used. (The ULDs should also check
the sense data they do receive. It is possible/probable that the event is
recognized based on the sense data even if the counter check misses it.)

This solution is easy to implement. I have a test implementation running and
it seems to be working.

Other solutions that have come into my mind are much more complicated 
than the counters. Here are examples:
- the UNIT ATTENTIONs would be sent to all ULDs attached to the device
  when they issue the next SCSI command
- the ULDs would have possibility to register a callback for UNIT ATTENTIONs

John Meneghini Nov. 18, 2024, 8:45 p.m. UTC | #3

On 11/15/24 12:45, "Kai Mäkisara (Kolumbus)" wrote:
> The device->was_reset flag stays set in many (most?) cases forever, unless
> someone resets it. (Scsi_error resets it only if the associated kthread ends
> up locking the device door.) Because the flag stays set, the st device can
> notice it even if the sg device gets the UA.

Yes, this is my observation too. Once was_reset is set, it stays on until a mt load or mt unload command is done.  Rewinding or 
repositioning the tape does not clear it.  Perhaps this is what the code in st.c was there to clear was_reset.

> Using a flag set by an other layer is ugly (to put it mildly). Even uglier is that
> st clears the flag when it has noticed that it is set.
> 
> And there are cases where the device reset does not originate from the
> same computer.

Yes, this is the original case that caused the problem.  A true third party device reset comes from another host, connected to 
the same SCSI tape lun, through a different IT nexus.

/John

John Meneghini Nov. 18, 2024, 9:36 p.m. UTC | #4

On 11/16/24 10:35, Kai Mäkisara (Kolumbus) wrote:
>  
>> On 15. Nov 2024, at 19.45, Kai Mäkisara (Kolumbus) <kai.makisara@kolumbus.fi> wrote:
>>
> ...
>>
>> I still think that UNIT ATTENTIONs (UAs) reach the high level device without
>> problems. The problem is that the device attached to the target first issuing
>> a SCSI command after reset gets the UA. As long as this is st device,
>> there are no problems. But, if it is the sg device attached to the same target,
>> the tape device misses it. > ...
>>
>> And there are cases where the device reset does not originate from the
>> same computer.
>>
>> Does anyone have any suggestions?
> 
> Besides Power on/Reset, the same problem applies to the New Media case.

It seems to me this would apply for any Unit Attention. A hardware device can set a unit attention when ever acted upon by a 
third party.  It is difficult to test this type of event w/out a multi-initiator bus... some kind of configuration that connects 
multiple host to the same SCSI target/lun device.  Large tape library configurations can do this. I don't have a test bed like 
this.

> One solution might be the following: the midlevel maintains counters for
> the Power on/Reset and the Media change UNIT ATTENTIONs. The ULDs
> can read these counters (using wrappers). If the ULD find for a device that
> the counter value has changed, then the event corresponding to the counter
> has occurred. The problem of clearing event flags is avoided,

I'm not sure I understand what problem you are trying to solve... are you trying to get the mid layer to report Unit Attentions 
to all ULDs that are attached?  That is, you want the mid layer to report the UA to all ULDs, not just the last/latest ULD?

> A drawback comes from the counter wrap-arounds. If, e.g., four-bit counters
> are used and there are 16 Power on/Resets between the checks by the ULD,
> the event is missed when the counter is used. 

I don't think that would be a problem. As long as the latest and highest priority UA is reported, older, lower priority UAs can 
be overwritten.  For example, a New Media UA followed by a POR UA - the Power On reset should take precedence.

> The ULDs should also check
> the sense data they do receive. It is possible/probable that the event is
> recognized based on the sense data even if the counter check misses it.)

Yes, definitely. ULDs should check the sense data.

> This solution is easy to implement. I have a test implementation running and
> it seems to be working.

I've tested these patches that you've posted here with my locally attached LTO-5 tape drive.  Everything seems to fine...

You can find the test results at:

https://bugzilla.kernel.org/show_bug.cgi?id=219419#c20

Note that I've added my own instrumentation patch on top of your series as it helps with the debug logging.

linux(tape_test) > git logl -7
177c3f710de4 (HEAD -> tape_test) scsi: st: instrument the pos_unknown code [ John Meneghini / jmeneghi@redhat.com ]
7373d7d717a0 scsi: st: Restore some drive settings after reset [ John Meneghini / Kai.Makisara@kolumbus.fi ]
25be4d26ef58 scsi: st: Remove use of device->was_reset [ John Meneghini / Kai.Makisara@kolumbus.fi ]
c5e0a7687c6b scsi: st: New session only when Unit Attention for new tape [ John Meneghini / Kai.Makisara@kolumbus.fi ]
1ab8b314a34b scsi: st: Add MTIOCGET and MTLOAD to ioctls allowed after device reset [John Meneghini / Kai.Makisara@kolumbus.fi]
0013312311da scsi: st: Don't modify unknown block number in MTIOCGET [ John Meneghini / Kai.Makisara@kolumbus.fi ]
2d5404caa8c7 (tag: v6.12-rc7, branch_v6.12-rc7) Linux 6.12-rc7 [ Linus Torvalds / torvalds@linux-foundation.org ]

> 
> Other solutions that have come into my mind are much more complicated
> than the counters. Here are examples:
> - the UNIT ATTENTIONs would be sent to all ULDs attached to the device
>    when they issue the next SCSI command
> - the ULDs would have possibility to register a callback for UNIT ATTENTIONs

This is actually what the SCSI device is supposed to do.  If you have a truly SCSI compliant tape device, designed to support 
multi-initiator access, unit attentions are supposed to be maintained on an I_T by I_T basis. So a device reset from one I_T 
will set a UA on all I_Ts, and clearing one I_T will clear only that I_T nexus UA.  But I'm sure that some of the older, legacy, 
tape devices don't do this.

Let me know if I can help by testing any further patches.

/John

Kai Mäkisara (Kolumbus) Nov. 19, 2024, 3:09 p.m. UTC | #5

> On 18. Nov 2024, at 23.36, John Meneghini <jmeneghi@redhat.com> wrote:
> 
> 
> On 11/16/24 10:35, Kai Mäkisara (Kolumbus) wrote:
>> 

...

>> Besides Power on/Reset, the same problem applies to the New Media case.
> 
> It seems to me this would apply for any Unit Attention. A hardware device can set a unit attention when ever acted upon by a third party.  It is difficult to test this type of event w/out a multi-initiator bus... some kind of configuration that connects multiple host to the same SCSI target/lun device.  Large tape library configurations can do this. I don't have a test bed like this.

Yes, this applies to all Unit Attentions. I just mentioned the ones st is interested in.

> 
>> One solution might be the following: the midlevel maintains counters for
>> the Power on/Reset and the Media change UNIT ATTENTIONs. The ULDs
>> can read these counters (using wrappers). If the ULD find for a device that
>> the counter value has changed, then the event corresponding to the counter
>> has occurred. The problem of clearing event flags is avoided,
> 
> I'm not sure I understand what problem you are trying to solve... are you trying to get the mid layer to report Unit Attentions to all ULDs that are attached?  That is, you want the mid layer to report the UA to all ULDs, not just the last/latest ULD?

I would rather say that midlevel provides a method for all ULDs to see if certain
Unit Attentions have happened. One gets all the sense data, but others can
at least see that something has happened.

> 
>> A drawback comes from the counter wrap-arounds. If, e.g., four-bit counters
>> are used and there are 16 Power on/Resets between the checks by the ULD,
>> the event is missed when the counter is used. 
> 
> I don't think that would be a problem. As long as the latest and highest priority UA is reported, older, lower priority UAs can be overwritten.  For example, a New Media UA followed by a POR UA - the Power On reset should take precedence.

My text seems to be somewhat ambiguous. I meant that there would be two
counters, one for New Media events and one for PORs. St needs to know
which one (or both) has occurred.

...

>> Other solutions that have come into my mind are much more complicated
>> than the counters. Here are examples:
>> - the UNIT ATTENTIONs would be sent to all ULDs attached to the device
>>   when they issue the next SCSI command
>> - the ULDs would have possibility to register a callback for UNIT ATTENTIONs
> 
> This is actually what the SCSI device is supposed to do.  If you have a truly SCSI compliant tape device, designed to support multi-initiator access, unit attentions are supposed to be maintained on an I_T by I_T basis. So a device reset from one I_T will set a UA on all I_Ts, and clearing one I_T will clear only that I_T nexus UA.  But I'm sure that some of the older, legacy, tape devices don't do this.

Yes, this is what has motivated the suggestions above. SCSI does send Unit
Attentions to all initiators. Linux could extend this so that all ULDs attached to
a device get the Unit Attentions. However, one big problem is, where to find
someone to implement this?

John Meneghini Nov. 19, 2024, 5:42 p.m. UTC | #6

On 11/19/24 10:09, "Kai Mäkisara (Kolumbus)" wrote:
>> I'm not sure I understand what problem you are trying to solve... are you trying to get the mid 
>> layer to report Unit Attentions to all ULDs that are attached?  That is, you want the mid layer 
>> to report the UA to all ULDs, not just the last/latest ULD?

> I would rather say that midlevel provides a method for all ULDs to see if certain
> Unit Attentions have happened. One gets all the sense data, but others can
> at least see that something has happened.

OK. That might work.  Be sure to cc me on the patch and I will take a look.

/John

[1/2] scsi: st: Remove use of device->was_reset

Commit Message

Comments

Patch