diff mbox series

[GIT,PULL] SCSI fixes for 6.10-rc4

Message ID 317d2de5fdb5e27f8f493e0e0ad23640a41b6acf.camel@HansenPartnership.com (mailing list archive)
State Not Applicable
Headers show
Series [GIT,PULL] SCSI fixes for 6.10-rc4 | expand

Commit Message

James Bottomley June 21, 2024, 9:15 p.m. UTC
Two fixes: one in the ufs driver fixing an obvious memory leak and the
other (with a core flag based update) trying to prevent USB crashes by
stopping the core from issuing a request for the I/O Hints mode page.

The patch is available here:

git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git scsi-fixes

The short changelog is:

Bart Van Assche (2):
      scsi: usb: uas: Do not query the IO Advice Hints Grouping mode page for USB/UAS devices
      scsi: core: Introduce the BLIST_SKIP_IO_HINTS flag

Joel Slebodnick (1):
      scsi: ufs: core: Free memory allocated for model before reinit

And the diffstat:

 drivers/scsi/sd.c              | 4 ++++
 drivers/ufs/core/ufshcd.c      | 1 +
 drivers/usb/storage/scsiglue.c | 6 ++++++
 drivers/usb/storage/uas.c      | 7 +++++++
 include/scsi/scsi_devinfo.h    | 4 +++-
 5 files changed, 21 insertions(+), 1 deletion(-)

With full diff below.

James

---

Comments

Linus Torvalds June 21, 2024, 10:03 p.m. UTC | #1
On Fri, 21 Jun 2024 at 14:16, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> Two fixes: one in the ufs driver fixing an obvious memory leak and the
> other (with a core flag based update) trying to prevent USB crashes by
> stopping the core from issuing a request for the I/O Hints mode page.

Honestly, this mode page issue seems to happen every single time
somebody adds a new query.

Can we place just make the rule be that new mode pages are opt-in, and
*NOT* this kind of broken "opt-out several months later when we
noticed that it inevitably caused problems"?

Because if it isn't some mode page that we have already used for
years, it clearly isn't *that* important.

I'd like to note that the wikipedia page for SCSI mode pages doesn't
even mention 0a/05, and I had to go to some SCSI command document on
seagate.com to find it.

The only other hits on the first page of google? Linux kernel discussions.

That should give people a big heads up that "maybe this thing isn't
very common or commonly known about"?

Which in turn should be a big damn hint to not query it by default.

I've pulled this, but let's aim for this being the LAST TIME we add
some idiotic query for a magical mode page that is mentioned on page
413 in an obscure document, and that didn't exist ten years ago.

Because at this point, blaming "some USB devices" for not reacting
well to it is kind of silly. This isn't our first rodeo. You should
have known about this.

Maybe add a BIG COMMENT in sd_revalidate_disk() to not read random
code pages willy-nilly.

Not that people read comments, but maybe it will remind somebody that
we've done this before, and it never works.

I mean, we literally have this comment for just checking the read-only bit:

                 * First attempt: ask for all pages (0x3F), but only 4 bytes.
                 * We have to start carefully: some devices hang if we ask
                 * for more than is available.

and that's a mode sense command that is at least mentioned on the
wikipedia page.

(And no, I don't consider wikipedia to be somehow special or
authoritative - but it's at least a sign of "does anybody know about
this thing?")

             Linus
pr-tracker-bot@kernel.org June 21, 2024, 10:04 p.m. UTC | #2
The pull request you sent on Fri, 21 Jun 2024 17:15:57 -0400:

> git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git scsi-fixes

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/35bb670d65fc0f80c62383ab4f2544cec85ac57a

Thank you!
Martin K. Petersen June 22, 2024, 1:48 a.m. UTC | #3
Linus,

> Can we place just make the rule be that new mode pages are opt-in, and
> *NOT* this kind of broken "opt-out several months later when we
> noticed that it inevitably caused problems"?

The specific problem with mode pages is that there is no way to know
whether a given page is supported without asking for it. Whereas for
most of the other things we query at discovery time, the device provides
a list of supported pages we can consult before we attempt to query the
page itself.

> Because if it isn't some mode page that we have already used for
> years, it clearly isn't *that* important.
[...]
> That should give people a big heads up that "maybe this thing isn't
> very common or commonly known about"?

It is a new feature in SCSI spearheaded by the Android folks. That's why
there isn't a lot of information available about it elsewhere.

I am super picky about having good heuristics for when we should attempt
to query a device for new protocol capabilities. In this case we lacked
a reliable indicator that the feature was supported. And since there are
non-UFS devices being implemented which support it too, restricting the
mode page query to Android/UFS devices only did not seem appropriate.

Bart: How about wrapping access to that mode page in GROUP_SUP
and RSCS? It would be nice if we could key off a VPD or two before
attempting the MODE SENSE...
Linus Torvalds June 22, 2024, 1:56 a.m. UTC | #4
On Fri, 21 Jun 2024 at 18:48, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>
> The specific problem with mode pages is that there is no way to know
> whether a given page is supported without asking for it. Whereas for
> most of the other things we query at discovery time, the device provides
> a list of supported pages we can consult before we attempt to query the
> page itself.

Yes. I know.

But I also know that pretty much *EVERY* time the SCSI layer has
decided to start looking at some new piece of data, it turns out that
"Oh, look, all those devices have only ever been tested with operating
systems that did *NOT* look at that mode page or other thing, and
surprise surprise - not being tested means that it's buggy".

> It is a new feature in SCSI spearheaded by the Android folks. That's why
> there isn't a lot of information available about it elsewhere.

So no wonder random devices are buggy.

And I'm not putting down random devices. Quite the opposite. I'm
stating a well-known fact: untested things are buggy.

And no amount of "but but but it worked for me" is at all an argument.
If it hasn't been tested, it's almost certainly broken somewhere.

We've seen this over and over again.

> I am super picky about having good heuristics for when we should attempt
> to query a device for new protocol capabilities. In this case we lacked
> a reliable indicator that the feature was supported.

My argument is that things should be opt-in.

If it wasn't needed for the previous 30 years go SCSI history, it sure
as heck didn't suddenly become necessary today.

So you literally NEVER DO THIS unless the system admin has explicitly
enabled it.

That's what opt-in means.

And honestly, then the Android people can decide to opt in. Not random
other victims.

What's the advantage of just enabling random new features that have no
real use case today?

Put another way: why wasn't this an explicit opt-in from the get-go?
And why can't we make that be the rule going forward for the *NEXT*
time somebody introduces some random new mode page?

That was my ask.

           Linus
Bart Van Assche June 22, 2024, 4:24 p.m. UTC | #5
On 6/21/24 6:56 PM, Linus Torvalds wrote:
> But I also know that pretty much *EVERY* time the SCSI layer has
> decided to start looking at some new piece of data, it turns out that
> "Oh, look, all those devices have only ever been tested with operating
> systems that did *NOT* look at that mode page or other thing, and
> surprise surprise - not being tested means that it's buggy".

We got the message and we will do what we can to prevent future
regressions for USB devices.

As has been mentioned earlier, there is evidence in
sd_read_write_protect_flag() that SCSI devices may misbehave when
querying a mode page. However, I was not familiar with that code and
hence was not aware of the comments in that code. According to the git
history, these comments were added before 2005, that is before I started
reading the linux-scsi mailing list.

> My argument is that things should be opt-in.
> 
> If it wasn't needed for the previous 30 years go SCSI history, it sure
> as heck didn't suddenly become necessary today.
> 
> So you literally NEVER DO THIS unless the system admin has explicitly
> enabled it.
> 
> That's what opt-in means.
> 
> And honestly, then the Android people can decide to opt in. Not random
> other victims.
 >> What's the advantage of just enabling random new features that have no
> real use case today?
> 
> Put another way: why wasn't this an explicit opt-in from the get-go?
> And why can't we make that be the rule going forward for the *NEXT*
> time somebody introduces some random new mode page?

The new mode page has been introduced last year in SBC-5. UFS devices 
have a mix of SLC and TLC NAND internally and the new mode page allows
device vendors to reduce write amplification. This is important to UFS
device vendors.

I think that the new mode page is useful for all storage devices that
have a mix of slow and fast storage internally and hence that it is also
useful for some enterprise storage devices. This is why the new mode
page is read by default. But as has been mentioned above, we have
learned our lesson and will be much more careful in the future when
adding code that modifies the access pattern of the sd driver for USB
storage devices.

Thanks,

Bart.
Martin K. Petersen June 22, 2024, 4:35 p.m. UTC | #6
Hi Linus!

> But I also know that pretty much *EVERY* time the SCSI layer has
> decided to start looking at some new piece of data, it turns out that
> "Oh, look, all those devices have only ever been tested with operating
> systems that did *NOT* look at that mode page or other thing, and
> surprise surprise - not being tested means that it's buggy".

We have been working towards poking the same things in the same order as
a well-known desktop operating system. Explicitly to leverage USB device
manufacturer testing.

> Put another way: why wasn't this an explicit opt-in from the get-go?

Because the expectation is that it will be widely implemented and used
on pretty much anything that speaks SCSI except USB-attached gadgets.
Hence the ask to disable it for the USB transports instead of
complicating things for every other use case.

It is always unfortunate when a change causes regressions. I agree that,
given that this involved a mode page, there should have been extra
safeguards in place. I should have caught that.
diff mbox series

Patch

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index fbc11046bbf6..fe82baa924f8 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -63,6 +63,7 @@ 
 #include <scsi/scsi_cmnd.h>
 #include <scsi/scsi_dbg.h>
 #include <scsi/scsi_device.h>
+#include <scsi/scsi_devinfo.h>
 #include <scsi/scsi_driver.h>
 #include <scsi/scsi_eh.h>
 #include <scsi/scsi_host.h>
@@ -3118,6 +3119,9 @@  static void sd_read_io_hints(struct scsi_disk *sdkp, unsigned char *buffer)
 	struct scsi_mode_data data;
 	int res;
 
+	if (sdp->sdev_bflags & BLIST_SKIP_IO_HINTS)
+		return;
+
 	res = scsi_mode_sense(sdp, /*dbd=*/0x8, /*modepage=*/0x0a,
 			      /*subpage=*/0x05, buffer, SD_BUF_SIZE, SD_TIMEOUT,
 			      sdkp->max_retries, &data, &sshdr);
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index e5e9da61f15d..1b65e6ae4137 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -8787,6 +8787,7 @@  static int ufshcd_probe_hba(struct ufs_hba *hba, bool init_dev_params)
 	    (hba->quirks & UFSHCD_QUIRK_REINIT_AFTER_MAX_GEAR_SWITCH)) {
 		/* Reset the device and controller before doing reinit */
 		ufshcd_device_reset(hba);
+		ufs_put_device_desc(hba);
 		ufshcd_hba_stop(hba);
 		ufshcd_vops_reinit_notify(hba);
 		ret = ufshcd_hba_enable(hba);
diff --git a/drivers/usb/storage/scsiglue.c b/drivers/usb/storage/scsiglue.c
index b31464740f6c..8c8b5e6041cc 100644
--- a/drivers/usb/storage/scsiglue.c
+++ b/drivers/usb/storage/scsiglue.c
@@ -79,6 +79,12 @@  static int slave_alloc (struct scsi_device *sdev)
 	if (us->protocol == USB_PR_BULK && us->max_lun > 0)
 		sdev->sdev_bflags |= BLIST_FORCELUN;
 
+	/*
+	 * Some USB storage devices reset if the IO advice hints grouping mode
+	 * page is queried. Hence skip that mode page.
+	 */
+	sdev->sdev_bflags |= BLIST_SKIP_IO_HINTS;
+
 	return 0;
 }
 
diff --git a/drivers/usb/storage/uas.c b/drivers/usb/storage/uas.c
index a48870a87a29..b610a2de4ae5 100644
--- a/drivers/usb/storage/uas.c
+++ b/drivers/usb/storage/uas.c
@@ -21,6 +21,7 @@ 
 #include <scsi/scsi.h>
 #include <scsi/scsi_eh.h>
 #include <scsi/scsi_dbg.h>
+#include <scsi/scsi_devinfo.h>
 #include <scsi/scsi_cmnd.h>
 #include <scsi/scsi_device.h>
 #include <scsi/scsi_host.h>
@@ -820,6 +821,12 @@  static int uas_slave_alloc(struct scsi_device *sdev)
 	struct uas_dev_info *devinfo =
 		(struct uas_dev_info *)sdev->host->hostdata;
 
+	/*
+	 * Some USB storage devices reset if the IO advice hints grouping mode
+	 * page is queried. Hence skip that mode page.
+	 */
+	sdev->sdev_bflags |= BLIST_SKIP_IO_HINTS;
+
 	sdev->hostdata = devinfo;
 	return 0;
 }
diff --git a/include/scsi/scsi_devinfo.h b/include/scsi/scsi_devinfo.h
index 6b548dc2c496..1d79a3b536ce 100644
--- a/include/scsi/scsi_devinfo.h
+++ b/include/scsi/scsi_devinfo.h
@@ -69,8 +69,10 @@ 
 #define BLIST_RETRY_ITF		((__force blist_flags_t)(1ULL << 32))
 /* Always retry ABORTED_COMMAND with ASC 0xc1 */
 #define BLIST_RETRY_ASC_C1	((__force blist_flags_t)(1ULL << 33))
+/* Do not query the IO Advice Hints Grouping mode page */
+#define BLIST_SKIP_IO_HINTS	((__force blist_flags_t)(1ULL << 34))
 
-#define __BLIST_LAST_USED BLIST_RETRY_ASC_C1
+#define __BLIST_LAST_USED BLIST_SKIP_IO_HINTS
 
 #define __BLIST_HIGH_UNUSED (~(__BLIST_LAST_USED | \
 			       (__force blist_flags_t) \