Message ID | 20241115162003.3908-2-Kai.Makisara@kolumbus.fi (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | scsi: st: More reset patches | expand |
> On 15. Nov 2024, at 18.20, Kai Mäkisara <Kai.Makisara@kolumbus.fi> wrote: > > Don't use the was_reset flag set by scsi error handling. It is enough to > recognize device resets based in the UNIT ATTENTION sense data. (The > retry counts are zero and either REQ_OP_DRV_OUT or REC_OP_DRV_IN are > used to prevent retries by midlevel.) > Please hold this patch until the problem below has been solved. The following came into my mind when I looked at John's debugging data he sent to linux-scsi today. I still think that UNIT ATTENTIONs (UAs) reach the high level device without problems. The problem is that the device attached to the target first issuing a SCSI command after reset gets the UA. As long as this is st device, there are no problems. But, if it is the sg device attached to the same target, the tape device misses it. The device->was_reset flag stays set in many (most?) cases forever, unless someone resets it. (Scsi_error resets it only if the associated kthread ends up locking the device door.) Because the flag stays set, the st device can notice it even if the sg device gets the UA. Using a flag set by an other layer is ugly (to put it mildly). Even uglier is that st clears the flag when it has noticed that it is set. And there are cases where the device reset does not originate from the same computer. Does anyone have any suggestions?
> On 15. Nov 2024, at 19.45, Kai Mäkisara (Kolumbus) <kai.makisara@kolumbus.fi> wrote: > ... > > I still think that UNIT ATTENTIONs (UAs) reach the high level device without > problems. The problem is that the device attached to the target first issuing > a SCSI command after reset gets the UA. As long as this is st device, > there are no problems. But, if it is the sg device attached to the same target, > the tape device misses it. > ... > > And there are cases where the device reset does not originate from the > same computer. > > Does anyone have any suggestions? Besides Power on/Reset, the same problem applies to the New Media case. One solution might be the following: the midlevel maintains counters for the Power on/Reset and the Media change UNIT ATTENTIONs. The ULDs can read these counters (using wrappers). If the ULD find for a device that the counter value has changed, then the event corresponding to the counter has occurred. The problem of clearing event flags is avoided, A drawback comes from the counter wrap-arounds. If, e.g., four-bit counters are used and there are 16 Power on/Resets between the checks by the ULD, the event is missed when the counter is used. (The ULDs should also check the sense data they do receive. It is possible/probable that the event is recognized based on the sense data even if the counter check misses it.) This solution is easy to implement. I have a test implementation running and it seems to be working. Other solutions that have come into my mind are much more complicated than the counters. Here are examples: - the UNIT ATTENTIONs would be sent to all ULDs attached to the device when they issue the next SCSI command - the ULDs would have possibility to register a callback for UNIT ATTENTIONs
On 11/15/24 12:45, "Kai Mäkisara (Kolumbus)" wrote: > The device->was_reset flag stays set in many (most?) cases forever, unless > someone resets it. (Scsi_error resets it only if the associated kthread ends > up locking the device door.) Because the flag stays set, the st device can > notice it even if the sg device gets the UA. Yes, this is my observation too. Once was_reset is set, it stays on until a mt load or mt unload command is done. Rewinding or repositioning the tape does not clear it. Perhaps this is what the code in st.c was there to clear was_reset. > Using a flag set by an other layer is ugly (to put it mildly). Even uglier is that > st clears the flag when it has noticed that it is set. > > And there are cases where the device reset does not originate from the > same computer. Yes, this is the original case that caused the problem. A true third party device reset comes from another host, connected to the same SCSI tape lun, through a different IT nexus. /John
On 11/16/24 10:35, Kai Mäkisara (Kolumbus) wrote: > >> On 15. Nov 2024, at 19.45, Kai Mäkisara (Kolumbus) <kai.makisara@kolumbus.fi> wrote: >> > ... >> >> I still think that UNIT ATTENTIONs (UAs) reach the high level device without >> problems. The problem is that the device attached to the target first issuing >> a SCSI command after reset gets the UA. As long as this is st device, >> there are no problems. But, if it is the sg device attached to the same target, >> the tape device misses it. > ... >> >> And there are cases where the device reset does not originate from the >> same computer. >> >> Does anyone have any suggestions? > > Besides Power on/Reset, the same problem applies to the New Media case. It seems to me this would apply for any Unit Attention. A hardware device can set a unit attention when ever acted upon by a third party. It is difficult to test this type of event w/out a multi-initiator bus... some kind of configuration that connects multiple host to the same SCSI target/lun device. Large tape library configurations can do this. I don't have a test bed like this. > One solution might be the following: the midlevel maintains counters for > the Power on/Reset and the Media change UNIT ATTENTIONs. The ULDs > can read these counters (using wrappers). If the ULD find for a device that > the counter value has changed, then the event corresponding to the counter > has occurred. The problem of clearing event flags is avoided, I'm not sure I understand what problem you are trying to solve... are you trying to get the mid layer to report Unit Attentions to all ULDs that are attached? That is, you want the mid layer to report the UA to all ULDs, not just the last/latest ULD? > A drawback comes from the counter wrap-arounds. If, e.g., four-bit counters > are used and there are 16 Power on/Resets between the checks by the ULD, > the event is missed when the counter is used. I don't think that would be a problem. As long as the latest and highest priority UA is reported, older, lower priority UAs can be overwritten. For example, a New Media UA followed by a POR UA - the Power On reset should take precedence. > The ULDs should also check > the sense data they do receive. It is possible/probable that the event is > recognized based on the sense data even if the counter check misses it.) Yes, definitely. ULDs should check the sense data. > This solution is easy to implement. I have a test implementation running and > it seems to be working. I've tested these patches that you've posted here with my locally attached LTO-5 tape drive. Everything seems to fine... You can find the test results at: https://bugzilla.kernel.org/show_bug.cgi?id=219419#c20 Note that I've added my own instrumentation patch on top of your series as it helps with the debug logging. linux(tape_test) > git logl -7 177c3f710de4 (HEAD -> tape_test) scsi: st: instrument the pos_unknown code [ John Meneghini / jmeneghi@redhat.com ] 7373d7d717a0 scsi: st: Restore some drive settings after reset [ John Meneghini / Kai.Makisara@kolumbus.fi ] 25be4d26ef58 scsi: st: Remove use of device->was_reset [ John Meneghini / Kai.Makisara@kolumbus.fi ] c5e0a7687c6b scsi: st: New session only when Unit Attention for new tape [ John Meneghini / Kai.Makisara@kolumbus.fi ] 1ab8b314a34b scsi: st: Add MTIOCGET and MTLOAD to ioctls allowed after device reset [John Meneghini / Kai.Makisara@kolumbus.fi] 0013312311da scsi: st: Don't modify unknown block number in MTIOCGET [ John Meneghini / Kai.Makisara@kolumbus.fi ] 2d5404caa8c7 (tag: v6.12-rc7, branch_v6.12-rc7) Linux 6.12-rc7 [ Linus Torvalds / torvalds@linux-foundation.org ] > > Other solutions that have come into my mind are much more complicated > than the counters. Here are examples: > - the UNIT ATTENTIONs would be sent to all ULDs attached to the device > when they issue the next SCSI command > - the ULDs would have possibility to register a callback for UNIT ATTENTIONs This is actually what the SCSI device is supposed to do. If you have a truly SCSI compliant tape device, designed to support multi-initiator access, unit attentions are supposed to be maintained on an I_T by I_T basis. So a device reset from one I_T will set a UA on all I_Ts, and clearing one I_T will clear only that I_T nexus UA. But I'm sure that some of the older, legacy, tape devices don't do this. Let me know if I can help by testing any further patches. /John
> On 18. Nov 2024, at 23.36, John Meneghini <jmeneghi@redhat.com> wrote: > > > On 11/16/24 10:35, Kai Mäkisara (Kolumbus) wrote: >> ... >> Besides Power on/Reset, the same problem applies to the New Media case. > > It seems to me this would apply for any Unit Attention. A hardware device can set a unit attention when ever acted upon by a third party. It is difficult to test this type of event w/out a multi-initiator bus... some kind of configuration that connects multiple host to the same SCSI target/lun device. Large tape library configurations can do this. I don't have a test bed like this. Yes, this applies to all Unit Attentions. I just mentioned the ones st is interested in. > >> One solution might be the following: the midlevel maintains counters for >> the Power on/Reset and the Media change UNIT ATTENTIONs. The ULDs >> can read these counters (using wrappers). If the ULD find for a device that >> the counter value has changed, then the event corresponding to the counter >> has occurred. The problem of clearing event flags is avoided, > > I'm not sure I understand what problem you are trying to solve... are you trying to get the mid layer to report Unit Attentions to all ULDs that are attached? That is, you want the mid layer to report the UA to all ULDs, not just the last/latest ULD? I would rather say that midlevel provides a method for all ULDs to see if certain Unit Attentions have happened. One gets all the sense data, but others can at least see that something has happened. > >> A drawback comes from the counter wrap-arounds. If, e.g., four-bit counters >> are used and there are 16 Power on/Resets between the checks by the ULD, >> the event is missed when the counter is used. > > I don't think that would be a problem. As long as the latest and highest priority UA is reported, older, lower priority UAs can be overwritten. For example, a New Media UA followed by a POR UA - the Power On reset should take precedence. My text seems to be somewhat ambiguous. I meant that there would be two counters, one for New Media events and one for PORs. St needs to know which one (or both) has occurred. ... >> Other solutions that have come into my mind are much more complicated >> than the counters. Here are examples: >> - the UNIT ATTENTIONs would be sent to all ULDs attached to the device >> when they issue the next SCSI command >> - the ULDs would have possibility to register a callback for UNIT ATTENTIONs > > This is actually what the SCSI device is supposed to do. If you have a truly SCSI compliant tape device, designed to support multi-initiator access, unit attentions are supposed to be maintained on an I_T by I_T basis. So a device reset from one I_T will set a UA on all I_Ts, and clearing one I_T will clear only that I_T nexus UA. But I'm sure that some of the older, legacy, tape devices don't do this. Yes, this is what has motivated the suggestions above. SCSI does send Unit Attentions to all initiators. Linux could extend this so that all ULDs attached to a device get the Unit Attentions. However, one big problem is, where to find someone to implement this?
On 11/19/24 10:09, "Kai Mäkisara (Kolumbus)" wrote: >> I'm not sure I understand what problem you are trying to solve... are you trying to get the mid >> layer to report Unit Attentions to all ULDs that are attached? That is, you want the mid layer >> to report the UA to all ULDs, not just the last/latest ULD? > I would rather say that midlevel provides a method for all ULDs to see if certain > Unit Attentions have happened. One gets all the sense data, but others can > at least see that something has happened. OK. That might work. Be sure to cc me on the patch and I will take a look. /John
diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c index e8ef27d7ef61..3acaa3561c81 100644 --- a/drivers/scsi/st.c +++ b/drivers/scsi/st.c @@ -163,9 +163,11 @@ static const char *st_formats[] = { static int debugging = DEBUG; +/* Setting these non-zero may risk recognizing resets */ #define MAX_RETRIES 0 #define MAX_WRITE_RETRIES 0 #define MAX_READY_RETRIES 0 + #define NO_TAPE NOT_READY #define ST_TIMEOUT (900 * HZ) @@ -413,10 +415,11 @@ static int st_chk_result(struct scsi_tape *STp, struct st_request * SRpnt) if (cmdstatp->have_sense && cmdstatp->sense_hdr.asc == 0 && cmdstatp->sense_hdr.ascq == 0x17) STp->cleaning_req = 1; /* ASC and ASCQ => cleaning requested */ - if (cmdstatp->have_sense && scode == UNIT_ATTENTION && cmdstatp->sense_hdr.asc == 0x29) + if (cmdstatp->have_sense && scode == UNIT_ATTENTION && + cmdstatp->sense_hdr.asc == 0x29) { STp->pos_unknown = 1; /* ASC => power on / reset */ - - STp->pos_unknown |= STp->device->was_reset; + st_printk(KERN_WARNING, STp, "Power on/reset recognized."); + } if (cmdstatp->have_sense && scode == RECOVERED_ERROR @@ -3631,9 +3634,7 @@ static long st_ioctl(struct file *file, unsigned int cmd_in, unsigned long arg) retval = (-EIO); goto out; } - reset_state(STp); - /* remove this when the midlevel properly clears was_reset */ - STp->device->was_reset = 0; + reset_state(STp); /* Clears pos_unknown */ } if (mtc.mt_op != MTNOP && mtc.mt_op != MTSETBLK &&
Don't use the was_reset flag set by scsi error handling. It is enough to recognize device resets based in the UNIT ATTENTION sense data. (The retry counts are zero and either REQ_OP_DRV_OUT or REC_OP_DRV_IN are used to prevent retries by midlevel.) Signed-off-by: Kai Mäkisara <Kai.Makisara@kolumbus.fi> --- drivers/scsi/st.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-)