Message ID | 20230721100015.27124-1-liulongfang@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | USB:bugfix a controller halt error | expand |
On Fri, Jul 21, 2023 at 06:00:15PM +0800, liulongfang wrote: > On systems that use ECC memory. The ECC error of the memory will > cause the USB controller to halt. It causes the usb_control_msg() > operation to fail. Why does ECC memory matter here? > At this point, the returned buffer data is an abnormal value, and > continuing to use it will lead to incorrect results. > > Therefore, it is necessary to judge the return value and exit. > > Signed-off-by: liulongfang <liulongfang@huawei.com> > --- > drivers/usb/core/hub.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c > index a739403a9e45..6a43198be263 100644 > --- a/drivers/usb/core/hub.c > +++ b/drivers/usb/core/hub.c > @@ -4891,6 +4891,16 @@ hub_port_init(struct usb_hub *hub, struct usb_device *udev, int port1, > USB_DT_DEVICE << 8, 0, > buf, GET_DESCRIPTOR_BUFSIZE, > initial_descriptor_timeout); > + /* On systems that use ECC memory, ECC errors can > + * cause the USB controller to halt. > + * It causes this operation to fail. At this time, > + * the buf data is an abnormal value and needs to be exited. > + */ > + if (r < 0) { > + kfree(buf); > + goto fail; > + } Are you sure this is correct? How was this tested? Seems to me that this will still return "success" if this code path ever happens, what am I missing? thanks, greg k-h
On Fri, Jul 21, 2023 at 06:00:15PM +0800, liulongfang wrote: > On systems that use ECC memory. The ECC error of the memory will > cause the USB controller to halt. It causes the usb_control_msg() > operation to fail. How often does this happen in real life? (Besides, it seems to me that if your system is getting a bunch of ECC memory errors then you've got much worse problems than a simple USB failure!) And why do you worry about ECC memory failures in particular? Can't _any_ kind of failure cause the usb_control_msg() operation to fail? > At this point, the returned buffer data is an abnormal value, and > continuing to use it will lead to incorrect results. The driver already contains code to check for abnormal values. The check is not perfect, but it should prevent things from going too badly wrong. > Therefore, it is necessary to judge the return value and exit. > > Signed-off-by: liulongfang <liulongfang@huawei.com> There is a flaw in your reasoning. The operation carried out here is deliberately unsafe (for full-speed devices). It is made before we know the actual maxpacket size for ep0, and as a result it might return an error code even when it works okay. This shouldn't happen, but a lot of USB hardware is unreliable. Therefore we must not ignore the result merely because r < 0. If we do that, the kernel might stop working with some devices. Alan Stern > --- > drivers/usb/core/hub.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c > index a739403a9e45..6a43198be263 100644 > --- a/drivers/usb/core/hub.c > +++ b/drivers/usb/core/hub.c > @@ -4891,6 +4891,16 @@ hub_port_init(struct usb_hub *hub, struct usb_device *udev, int port1, > USB_DT_DEVICE << 8, 0, > buf, GET_DESCRIPTOR_BUFSIZE, > initial_descriptor_timeout); > + /* On systems that use ECC memory, ECC errors can > + * cause the USB controller to halt. > + * It causes this operation to fail. At this time, > + * the buf data is an abnormal value and needs to be exited. > + */ > + if (r < 0) { > + kfree(buf); > + goto fail; > + } > + > switch (buf->bMaxPacketSize0) { > case 8: case 16: case 32: case 64: case 255: > if (buf->bDescriptorType == > -- > 2.24.0 >
On 21.07.23 16:57, Alan Stern wrote: > There is a flaw in your reasoning. > > The operation carried out here is deliberately unsafe (for full-speed > devices). It is made before we know the actual maxpacket size for ep0, > and as a result it might return an error code even when it works okay. > This shouldn't happen, but a lot of USB hardware is unreliable. > > Therefore we must not ignore the result merely because r < 0. If we do > that, the kernel might stop working with some devices. Right. However, we must make sure we are operating on controlled results. As is we are operating on a random buffer without checking an IO operation has been performed on it. Regards Oliver
On 2023/7/21 19:08, Greg KH Wrote: > On Fri, Jul 21, 2023 at 06:00:15PM +0800, liulongfang wrote: >> On systems that use ECC memory. The ECC error of the memory will >> cause the USB controller to halt. It causes the usb_control_msg() >> operation to fail. > > Why does ECC memory matter here? > This is a test conducted under a special test scenario. ECC memory errors are caused by some test tools. >> At this point, the returned buffer data is an abnormal value, and >> continuing to use it will lead to incorrect results. >> >> Therefore, it is necessary to judge the return value and exit. >> >> Signed-off-by: liulongfang <liulongfang@huawei.com> >> --- >> drivers/usb/core/hub.c | 10 ++++++++++ >> 1 file changed, 10 insertions(+) >> >> diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c >> index a739403a9e45..6a43198be263 100644 >> --- a/drivers/usb/core/hub.c >> +++ b/drivers/usb/core/hub.c >> @@ -4891,6 +4891,16 @@ hub_port_init(struct usb_hub *hub, struct usb_device *udev, int port1, >> USB_DT_DEVICE << 8, 0, >> buf, GET_DESCRIPTOR_BUFSIZE, >> initial_descriptor_timeout); >> + /* On systems that use ECC memory, ECC errors can >> + * cause the USB controller to halt. >> + * It causes this operation to fail. At this time, >> + * the buf data is an abnormal value and needs to be exited. >> + */ >> + if (r < 0) { >> + kfree(buf); >> + goto fail; >> + } > > Are you sure this is correct? How was this tested? Seems to me that > this will still return "success" if this code path ever happens, what am You are right. I made a patch error here. The code modification should be like this: if (r < 0) { retval = r; kfree(buf); goto fail; } thanks, Longfang. > I missing? > > thanks, > > greg k-h > . >
On 2023/7/21 22:57, Alan Stern Wrote: > On Fri, Jul 21, 2023 at 06:00:15PM +0800, liulongfang wrote: >> On systems that use ECC memory. The ECC error of the memory will >> cause the USB controller to halt. It causes the usb_control_msg() >> operation to fail. > > How often does this happen in real life? (Besides, it seems to me that > if your system is getting a bunch of ECC memory errors then you've got > much worse problems than a simple USB failure!) > This problem is on ECC memory platform. In the test scenario, the problem is 100% reproducible. > And why do you worry about ECC memory failures in particular? Can't > _any_ kind of failure cause the usb_control_msg() operation to fail? > >> At this point, the returned buffer data is an abnormal value, and >> continuing to use it will lead to incorrect results. > > The driver already contains code to check for abnormal values. The > check is not perfect, but it should prevent things from going too > badly wrong. > If it is ECC memory error. These parameter checks would also actually be invalid. >> Therefore, it is necessary to judge the return value and exit. >> >> Signed-off-by: liulongfang <liulongfang@huawei.com> > > There is a flaw in your reasoning. > > The operation carried out here is deliberately unsafe (for full-speed > devices). It is made before we know the actual maxpacket size for ep0, > and as a result it might return an error code even when it works okay. > This shouldn't happen, but a lot of USB hardware is unreliable. > > Therefore we must not ignore the result merely because r < 0. If we do > that, the kernel might stop working with some devices. > It may be that the handling solution for ECC errors is different from that of the OS platform. On the test platform, after usb_control_msg() fails, reading the memory data of buf will directly lead to kernel crash: [ T14] Call trace: [ T14] hub_port_init+0x280/0x9f0 [ T14] hub_port_connect+0x1d4/0xa40 [ T14] hub_port_connect_change+0xb8/0x2b0 [ T14] port_event+0x430/0x5d0 [ T14] hub_event+0x138/0x4a0 [ T14] process_one_work+0x1c8/0x39c [ T14] worker_thread+0x150/0x3d0 [ T14] kthread+0xfc/0x130 [ T14] ret_from_fork+0x10/0x18 [ T14] Code: 528000c2 b9007fea 94002c9a b9407fea (39401f41) thanks, Longfang. > Alan Stern > >> --- >> drivers/usb/core/hub.c | 10 ++++++++++ >> 1 file changed, 10 insertions(+) >> >> diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c >> index a739403a9e45..6a43198be263 100644 >> --- a/drivers/usb/core/hub.c >> +++ b/drivers/usb/core/hub.c >> @@ -4891,6 +4891,16 @@ hub_port_init(struct usb_hub *hub, struct usb_device *udev, int port1, >> USB_DT_DEVICE << 8, 0, >> buf, GET_DESCRIPTOR_BUFSIZE, >> initial_descriptor_timeout); >> + /* On systems that use ECC memory, ECC errors can >> + * cause the USB controller to halt. >> + * It causes this operation to fail. At this time, >> + * the buf data is an abnormal value and needs to be exited. >> + */ >> + if (r < 0) { >> + kfree(buf); >> + goto fail; >> + } >> + >> switch (buf->bMaxPacketSize0) { >> case 8: case 16: case 32: case 64: case 255: >> if (buf->bDescriptorType == >> -- >> 2.24.0 >> > > . >
On Wed, Jul 26, 2023 at 02:44:01PM +0800, liulongfang wrote: > On 2023/7/21 19:08, Greg KH Wrote: > > On Fri, Jul 21, 2023 at 06:00:15PM +0800, liulongfang wrote: > >> On systems that use ECC memory. The ECC error of the memory will > >> cause the USB controller to halt. It causes the usb_control_msg() > >> operation to fail. > > > > Why does ECC memory matter here? > > > > This is a test conducted under a special test scenario. > ECC memory errors are caused by some test tools. What memory is failing, and why does just this single check matter in the whole kernel? If hardware is broken, and failing, it's not the job of the kernel to protect against that, is it? Shouldn't the ECC memory controller have properly notified the kernel of the fault and reset the machine because it is now in an undetermined state? > > Are you sure this is correct? How was this tested? Seems to me that > > this will still return "success" if this code path ever happens, what am > > You are right. I made a patch error here. The code modification should be like this: > if (r < 0) { > retval = r; > kfree(buf); > goto fail; > } This means that you didn't test this change at all, so I don't really think it is needed :( thanks, greg k-h
On 26.07.23 09:18, Greg KH wrote: > On Wed, Jul 26, 2023 at 02:44:01PM +0800, liulongfang wrote: >> This is a test conducted under a special test scenario. >> ECC memory errors are caused by some test tools. So we are looking at a corner case here. I think we need to step back, get an overview. And I would like to apologize for not being entirely helpful. I see a theoretical possibility here of what is going wrong and an extremely theoretical bug, but it would be very good if you could describe your test setup. So. You are inducing simulated memory errors. For this scenario to make any sense your failure must be 1. temporary - that is you have detected memory corruption but the RAM cell is not broken 2. unrecoverable - that is we have lost data 3. locateable - that is you know it hit the buffer of this operation and only it Am I correct so far? Furthermore your system reports the error to the HC, so that in last consequence the transfer fails. Right? > What memory is failing, and why does just this single check matter in > the whole kernel? The difference here is that we are deliberately ignoring errors. > If hardware is broken, and failing, it's not the job of the kernel to > protect against that, is it? Shouldn't the ECC memory controller have > properly notified the kernel of the fault Definitely it should. But this whole discussion makes only sense if exactly that happens. > and reset the machine because > it is now in an undetermined state? No. It is not in an undetermined state if your detection logic is good enough. Regards Oliver
On Wed, Jul 26, 2023 at 02:58:18PM +0800, liulongfang wrote: > On 2023/7/21 22:57, Alan Stern Wrote: > > On Fri, Jul 21, 2023 at 06:00:15PM +0800, liulongfang wrote: > >> On systems that use ECC memory. The ECC error of the memory will > >> cause the USB controller to halt. It causes the usb_control_msg() > >> operation to fail. > > > > How often does this happen in real life? (Besides, it seems to me that > > if your system is getting a bunch of ECC memory errors then you've got > > much worse problems than a simple USB failure!) > > > > This problem is on ECC memory platform. > In the test scenario, the problem is 100% reproducible. > > > And why do you worry about ECC memory failures in particular? Can't > > _any_ kind of failure cause the usb_control_msg() operation to fail? > > > >> At this point, the returned buffer data is an abnormal value, and > >> continuing to use it will lead to incorrect results. > > > > The driver already contains code to check for abnormal values. The > > check is not perfect, but it should prevent things from going too > > badly wrong. > > > > If it is ECC memory error. These parameter checks would also > actually be invalid. > > >> Therefore, it is necessary to judge the return value and exit. > >> > >> Signed-off-by: liulongfang <liulongfang@huawei.com> > > > > There is a flaw in your reasoning. > > > > The operation carried out here is deliberately unsafe (for full-speed > > devices). It is made before we know the actual maxpacket size for ep0, > > and as a result it might return an error code even when it works okay. > > This shouldn't happen, but a lot of USB hardware is unreliable. > > > > Therefore we must not ignore the result merely because r < 0. If we do > > that, the kernel might stop working with some devices. > > > It may be that the handling solution for ECC errors is different from that > of the OS platform. On the test platform, after usb_control_msg() fails, > reading the memory data of buf will directly lead to kernel crash: All right, then here's a proposal for a different way to solve the problem: Change the kernel's handler for the ECC error notification. Have it clear the affected parts of memory, so that the kernel can go ahead and use them without crashing. It seems to me that something along these lines must be necessary in any case. Unless the bad memory is cleared somehow, it would never be usable again. The kernel might deallocate it, then reallocate for another purpose, and then crash when the new user tries to access it. In fact, this scenario could still happen even with your patch, which means the patch doesn't really fix the problem. Alan Stern
On 2023/7/26 15:18, Greg KH wrote: > On Wed, Jul 26, 2023 at 02:44:01PM +0800, liulongfang wrote: >> On 2023/7/21 19:08, Greg KH Wrote: >>> On Fri, Jul 21, 2023 at 06:00:15PM +0800, liulongfang wrote: >>>> On systems that use ECC memory. The ECC error of the memory will >>>> cause the USB controller to halt. It causes the usb_control_msg() >>>> operation to fail. >>> >>> Why does ECC memory matter here? >>> >> >> This is a test conducted under a special test scenario. >> ECC memory errors are caused by some test tools. > > What memory is failing, and why does just this single check matter in > the whole kernel? > > If hardware is broken, and failing, it's not the job of the kernel to > protect against that, is it? Shouldn't the ECC memory controller have > properly notified the kernel of the fault and reset the machine because > it is now in an undetermined state? > >>> Are you sure this is correct? How was this tested? Seems to me that >>> this will still return "success" if this code path ever happens, what am >> >> You are right. I made a patch error here. The code modification should be like this: >> if (r < 0) { >> retval = r; >> kfree(buf); >> goto fail; >> } > > This means that you didn't test this change at all, so I don't really > think it is needed :( > In fact, currently we have not added this retval assignment operation. Due to the circumvention of buf access during patch testing. This problem causes calltrace not to trigger. Thanks. Longfang. > thanks, > > greg k-h > . >
On 2023/7/26 19:16, Oliver Neukum wrote: > On 26.07.23 09:18, Greg KH wrote: >> On Wed, Jul 26, 2023 at 02:44:01PM +0800, liulongfang wrote: > >>> This is a test conducted under a special test scenario. >>> ECC memory errors are caused by some test tools. > > So we are looking at a corner case here. > > I think we need to step back, get an overview. And I would > like to apologize for not being entirely helpful. > > I see a theoretical possibility here of what is going wrong > and an extremely theoretical bug, but it would be very good > if you could describe your test setup. > > So. You are inducing simulated memory errors. > For this scenario to make any sense your failure must be > > 1. temporary - that is you have detected memory corruption but the RAM cell is not broken > 2. unrecoverable - that is we have lost data > 3. locateable - that is you know it hit the buffer of this operation and only it > > Am I correct so far? > You are right about the testing process. But this problem can exist in the real environment, just the probability of occurrence is very low. Use a test tool just to make it easier to trigger it. > Furthermore your system reports the error to the HC, so that in last > consequence the transfer fails. Right? > >> What memory is failing, and why does just this single check matter in >> the whole kernel? > > The difference here is that we are deliberately ignoring errors. > >> If hardware is broken, and failing, it's not the job of the kernel to >> protect against that, is it? Shouldn't the ECC memory controller have >> properly notified the kernel of the fault > > Definitely it should. But this whole discussion makes only sense > if exactly that happens. > Our test tool only simulates that external interference destroys this part of the data in the buffer on the ECC memory. Even without this testing tool. This problem may also occur on real business hardware devices. >> and reset the machine because >> it is now in an undetermined state? > > No. It is not in an undetermined state if your detection logic is good enough. > > Regards > Oliver > . > Thanks, Longfang.
On 2023/7/26 22:20, Alan Stern wrote: > On Wed, Jul 26, 2023 at 02:58:18PM +0800, liulongfang wrote: >> On 2023/7/21 22:57, Alan Stern Wrote: >>> On Fri, Jul 21, 2023 at 06:00:15PM +0800, liulongfang wrote: >>>> On systems that use ECC memory. The ECC error of the memory will >>>> cause the USB controller to halt. It causes the usb_control_msg() >>>> operation to fail. >>> >>> How often does this happen in real life? (Besides, it seems to me that >>> if your system is getting a bunch of ECC memory errors then you've got >>> much worse problems than a simple USB failure!) >>> >> >> This problem is on ECC memory platform. >> In the test scenario, the problem is 100% reproducible. >> >>> And why do you worry about ECC memory failures in particular? Can't >>> _any_ kind of failure cause the usb_control_msg() operation to fail? >>> >>>> At this point, the returned buffer data is an abnormal value, and >>>> continuing to use it will lead to incorrect results. >>> >>> The driver already contains code to check for abnormal values. The >>> check is not perfect, but it should prevent things from going too >>> badly wrong. >>> >> >> If it is ECC memory error. These parameter checks would also >> actually be invalid. >> >>>> Therefore, it is necessary to judge the return value and exit. >>>> >>>> Signed-off-by: liulongfang <liulongfang@huawei.com> >>> >>> There is a flaw in your reasoning. >>> >>> The operation carried out here is deliberately unsafe (for full-speed >>> devices). It is made before we know the actual maxpacket size for ep0, >>> and as a result it might return an error code even when it works okay. >>> This shouldn't happen, but a lot of USB hardware is unreliable. >>> >>> Therefore we must not ignore the result merely because r < 0. If we do >>> that, the kernel might stop working with some devices. >>> >> It may be that the handling solution for ECC errors is different from that >> of the OS platform. On the test platform, after usb_control_msg() fails, >> reading the memory data of buf will directly lead to kernel crash: > > All right, then here's a proposal for a different way to solve the > problem: Change the kernel's handler for the ECC error notification. > Have it clear the affected parts of memory, so that the kernel can go > ahead and use them without crashing. > > It seems to me that something along these lines must be necessary in > any case. Unless the bad memory is cleared somehow, it would never be > usable again. The kernel might deallocate it, then reallocate for > another purpose, and then crash when the new user tries to access it. > > In fact, this scenario could still happen even with your patch, which > means the patch doesn't really fix the problem. > This patch is only used to prevent data in the buffer from being accessed. As long as the data is not accessed, the kernel does not crash. thanks, Longfang. > Alan Stern > > . >
On 27.07.23 09:00, liulongfang wrote: > On 2023/7/26 19:16, Oliver Neukum wrote: >> 1. temporary - that is you have detected memory corruption but the RAM cell is not broken >> 2. unrecoverable - that is we have lost data >> 3. locateable - that is you know it hit the buffer of this operation and only it >> >> Am I correct so far? >> > You are right about the testing process. > But this problem can exist in the real environment, just the probability of > occurrence is very low. Understood. Bit flips are random. But this leaves two open questions. 1. How is the error reported 2. How are we supposed to handle it Firstly, if we already know that there is an ECC failure on the host we can use a specific error code and can check for that. Secondly, does this mean that the affected memory location must not be touched until the machine is power cycled or does it simply mean that the buffer is invalid? > Our test tool only simulates that external interference destroys this part > of the data in the buffer on the ECC memory. Even without this testing tool. > This problem may also occur on real business hardware devices. Understood. But what is the correct remedy if teh problem strikes for real? Regards Oliver
On 27.07.23 09:03, liulongfang wrote: > This patch is only used to prevent data in the buffer from being accessed. > As long as the data is not accessed, the kernel does not crash. For how long? That information is cruical. Regards Oliver
On Thu, Jul 27, 2023 at 03:03:57PM +0800, liulongfang wrote: > On 2023/7/26 22:20, Alan Stern wrote: > >> It may be that the handling solution for ECC errors is different from that > >> of the OS platform. On the test platform, after usb_control_msg() fails, > >> reading the memory data of buf will directly lead to kernel crash: > > > > All right, then here's a proposal for a different way to solve the > > problem: Change the kernel's handler for the ECC error notification. > > Have it clear the affected parts of memory, so that the kernel can go > > ahead and use them without crashing. > > > > It seems to me that something along these lines must be necessary in > > any case. Unless the bad memory is cleared somehow, it would never be > > usable again. The kernel might deallocate it, then reallocate for > > another purpose, and then crash when the new user tries to access it. > > > > In fact, this scenario could still happen even with your patch, which > > means the patch doesn't really fix the problem. > > > > This patch is only used to prevent data in the buffer from being accessed. > As long as the data is not accessed, the kernel does not crash. I still don't understand. You haven't provided nearly enough information. You should start by answering the questions that Oliver asked. Then answer this question: The code you are concerned about is this: r = usb_control_msg(udev, usb_rcvaddr0pipe(), USB_REQ_GET_DESCRIPTOR, USB_DIR_IN, USB_DT_DEVICE << 8, 0, buf, GET_DESCRIPTOR_BUFSIZE, initial_descriptor_timeout); switch (buf->bMaxPacketSize0) { You're worried that if an ECC memory error occurs during the usb_control_msg transfer, the kernel will crash when the "switch" statement tries to read the value of buf->bMaxPacketSize0. That's a reasonable thing to worry about. Now think about what will happen if usb_control_msg works successfully but an ECC memory error occurs when the return code from the function call is stored in r? Won't the kernel crash then? Or if not then, when it reads the value of r a few lines later? So why bother to handle the first kind of ECC error but not the second? What makes one ECC error more important than another? Alan Stern
On 27.07.23 16:42, Alan Stern wrote: > On Thu, Jul 27, 2023 at 03:03:57PM +0800, liulongfang wrote: >> On 2023/7/26 22:20, Alan Stern wrote: >>> It seems to me that something along these lines must be necessary in >>> any case. Unless the bad memory is cleared somehow, it would never be >>> usable again. The kernel might deallocate it, then reallocate for >>> another purpose, and then crash when the new user tries to access it. >>> >>> In fact, this scenario could still happen even with your patch, which >>> means the patch doesn't really fix the problem. I suppose in theory you could have something like a bad blocks list just for RAM, but that would really hurt. You'd have to do something about every DMA operation in every driver in theory. Error handling would basically be an intentional memory leak. >> This patch is only used to prevent data in the buffer from being accessed. >> As long as the data is not accessed, the kernel does not crash. > > I still don't understand. You haven't provided nearly enough > information. You should start by answering the questions that Oliver > asked. Then answer this question: > > The code you are concerned about is this: > > r = usb_control_msg(udev, usb_rcvaddr0pipe(), > USB_REQ_GET_DESCRIPTOR, USB_DIR_IN, > USB_DT_DEVICE << 8, 0, > buf, GET_DESCRIPTOR_BUFSIZE, > initial_descriptor_timeout); > switch (buf->bMaxPacketSize0) { > > You're worried that if an ECC memory error occurs during the > usb_control_msg transfer, the kernel will crash when the "switch" > statement tries to read the value of buf->bMaxPacketSize0. That's a > reasonable thing to worry about. Albeit unlikely. If the hardware and implementation are reasonable you'd return a specific error code from the HCD and clean up the RAM in your ecc driver. The fix for USB would then conceptually be something like retryio: r = usb_control_msg() if (r == -EMEMORYCORRUPTION) goto retryio; Regards Oliver
On Thu, Jul 27, 2023 at 05:31:41PM +0200, Oliver Neukum wrote: > On 27.07.23 16:42, Alan Stern wrote: > > On Thu, Jul 27, 2023 at 03:03:57PM +0800, liulongfang wrote: > > > On 2023/7/26 22:20, Alan Stern wrote: > > > > > It seems to me that something along these lines must be necessary in > > > > any case. Unless the bad memory is cleared somehow, it would never be > > > > usable again. The kernel might deallocate it, then reallocate for > > > > another purpose, and then crash when the new user tries to access it. > > > > > > > > In fact, this scenario could still happen even with your patch, which > > > > means the patch doesn't really fix the problem. > > I suppose in theory you could have something like a bad blocks list > just for RAM, but that would really hurt. You'd have to do something > about every DMA operation in every driver in theory. > > Error handling would basically be an intentional memory leak. I started out thinking this way, but maybe that's not how it works. Perhaps simply overwriting the part of memory that got the ECC error would clear the error state. (This may depend on the kind of error, one-time vs. permanent.) If that's the case, and if the memory buffer was deallocated without being accessed and then later reallocated, things would be okay. The routine that reallocated the buffer wouldn't try to read from it before initializing it somehow. > > > This patch is only used to prevent data in the buffer from being accessed. > > > As long as the data is not accessed, the kernel does not crash. > > > > I still don't understand. You haven't provided nearly enough > > information. You should start by answering the questions that Oliver > > asked. Then answer this question: > > > > The code you are concerned about is this: > > > > r = usb_control_msg(udev, usb_rcvaddr0pipe(), > > USB_REQ_GET_DESCRIPTOR, USB_DIR_IN, > > USB_DT_DEVICE << 8, 0, > > buf, GET_DESCRIPTOR_BUFSIZE, > > initial_descriptor_timeout); > > switch (buf->bMaxPacketSize0) { > > > > You're worried that if an ECC memory error occurs during the > > usb_control_msg transfer, the kernel will crash when the "switch" > > statement tries to read the value of buf->bMaxPacketSize0. That's a > > reasonable thing to worry about. > > Albeit unlikely. If the hardware and implementation are reasonable > you'd return a specific error code from the HCD and clean up the > RAM in your ecc driver. > > The fix for USB would then conceptually be something like > > retryio: > r = usb_control_msg() > if (r == -EMEMORYCORRUPTION) > goto retryio; Yes, we could do this, but it's not necessary. Let's say that the HCD returns -EMEMORYCORRUPTION and the ecc driver cleans up the RAM (probably by resetting its contents to 0, but possibly leaving garbage there instead). Then when the following code in hub_port_init() tests buf->bMaxPacketSize0, it will see an invalid value and will retry the transfer. Or, with low probability, it will see a valid but incorrect value. If that happens then later transfers using ep0 will fail, causing the hub driver to reiterate the outer loop in hub_port_connect(). Eventually the device will be correctly initialized and enumerated. Alan Stern
On 2023/7/27 23:57, Alan Stern wrote: > On Thu, Jul 27, 2023 at 05:31:41PM +0200, Oliver Neukum wrote: >> On 27.07.23 16:42, Alan Stern wrote: >>> On Thu, Jul 27, 2023 at 03:03:57PM +0800, liulongfang wrote: >>>> On 2023/7/26 22:20, Alan Stern wrote: >> >>>>> It seems to me that something along these lines must be necessary in >>>>> any case. Unless the bad memory is cleared somehow, it would never be >>>>> usable again. The kernel might deallocate it, then reallocate for >>>>> another purpose, and then crash when the new user tries to access it. >>>>> >>>>> In fact, this scenario could still happen even with your patch, which >>>>> means the patch doesn't really fix the problem. >> >> I suppose in theory you could have something like a bad blocks list >> just for RAM, but that would really hurt. You'd have to do something >> about every DMA operation in every driver in theory. >> >> Error handling would basically be an intentional memory leak. > > I started out thinking this way, but maybe that's not how it works. > Perhaps simply overwriting the part of memory that got the ECC error > would clear the error state. (This may depend on the kind of error, > one-time vs. permanent.) > > If that's the case, and if the memory buffer was deallocated without > being accessed and then later reallocated, things would be okay. The > routine that reallocated the buffer wouldn't try to read from it before > initializing it somehow. > >>>> This patch is only used to prevent data in the buffer from being accessed. >>>> As long as the data is not accessed, the kernel does not crash. >>> >>> I still don't understand. You haven't provided nearly enough >>> information. You should start by answering the questions that Oliver >>> asked. Then answer this question: >>> >>> The code you are concerned about is this: >>> >>> r = usb_control_msg(udev, usb_rcvaddr0pipe(), >>> USB_REQ_GET_DESCRIPTOR, USB_DIR_IN, >>> USB_DT_DEVICE << 8, 0, >>> buf, GET_DESCRIPTOR_BUFSIZE, >>> initial_descriptor_timeout); >>> switch (buf->bMaxPacketSize0) { >>> >>> You're worried that if an ECC memory error occurs during the >>> usb_control_msg transfer, the kernel will crash when the "switch" >>> statement tries to read the value of buf->bMaxPacketSize0. That's a >>> reasonable thing to worry about. >> >> Albeit unlikely. If the hardware and implementation are reasonable >> you'd return a specific error code from the HCD and clean up the >> RAM in your ecc driver. >> >> The fix for USB would then conceptually be something like >> >> retryio: >> r = usb_control_msg() >> if (r == -EMEMORYCORRUPTION) >> goto retryio; > > Yes, we could do this, but it's not necessary. Let's say that the HCD > returns -EMEMORYCORRUPTION and the ecc driver cleans up the RAM > (probably by resetting its contents to 0, but possibly leaving garbage > there instead). Then when the following code in hub_port_init() tests > buf->bMaxPacketSize0, it will see an invalid value and will retry the > transfer. > > Or, with low probability, it will see a valid but incorrect value. If > that happens then later transfers using ep0 will fail, causing the hub > driver to reiterate the outer loop in hub_port_connect(). Eventually > the device will be correctly initialized and enumerated. > > Alan Stern > OK, thanks. Longfang. > . >
diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c index a739403a9e45..6a43198be263 100644 --- a/drivers/usb/core/hub.c +++ b/drivers/usb/core/hub.c @@ -4891,6 +4891,16 @@ hub_port_init(struct usb_hub *hub, struct usb_device *udev, int port1, USB_DT_DEVICE << 8, 0, buf, GET_DESCRIPTOR_BUFSIZE, initial_descriptor_timeout); + /* On systems that use ECC memory, ECC errors can + * cause the USB controller to halt. + * It causes this operation to fail. At this time, + * the buf data is an abnormal value and needs to be exited. + */ + if (r < 0) { + kfree(buf); + goto fail; + } + switch (buf->bMaxPacketSize0) { case 8: case 16: case 32: case 64: case 255: if (buf->bDescriptorType ==
On systems that use ECC memory. The ECC error of the memory will cause the USB controller to halt. It causes the usb_control_msg() operation to fail. At this point, the returned buffer data is an abnormal value, and continuing to use it will lead to incorrect results. Therefore, it is necessary to judge the return value and exit. Signed-off-by: liulongfang <liulongfang@huawei.com> --- drivers/usb/core/hub.c | 10 ++++++++++ 1 file changed, 10 insertions(+)