USB:bugfix a controller halt error

Message ID	20230721100015.27124-1-liulongfang@huawei.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-usb-owner@vger.kernel.org> From: liulongfang <liulongfang@huawei.com> To: <gregkh@linuxfoundation.org> CC: <linux-usb@vger.kernel.org>, <linux-kernel@vger.kernel.org> Subject: [PATCH] USB:bugfix a controller halt error Date: Fri, 21 Jul 2023 18:00:15 +0800 Message-ID: <20230721100015.27124-1-liulongfang@huawei.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII Precedence: bulk
Series	USB:bugfix a controller halt error \| expand USB:bugfix a controller halt error

liulongfang July 21, 2023, 10 a.m. UTC

On systems that use ECC memory. The ECC error of the memory will
cause the USB controller to halt. It causes the usb_control_msg()
operation to fail.
At this point, the returned buffer data is an abnormal value, and
continuing to use it will lead to incorrect results.

Therefore, it is necessary to judge the return value and exit.

Signed-off-by: liulongfang <liulongfang@huawei.com>
---
 drivers/usb/core/hub.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Greg Kroah-Hartman July 21, 2023, 11:08 a.m. UTC | #1

On Fri, Jul 21, 2023 at 06:00:15PM +0800, liulongfang wrote:
> On systems that use ECC memory. The ECC error of the memory will
> cause the USB controller to halt. It causes the usb_control_msg()
> operation to fail.

Why does ECC memory matter here?

> At this point, the returned buffer data is an abnormal value, and
> continuing to use it will lead to incorrect results.
> 
> Therefore, it is necessary to judge the return value and exit.
> 
> Signed-off-by: liulongfang <liulongfang@huawei.com>
> ---
>  drivers/usb/core/hub.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
> index a739403a9e45..6a43198be263 100644
> --- a/drivers/usb/core/hub.c
> +++ b/drivers/usb/core/hub.c
> @@ -4891,6 +4891,16 @@ hub_port_init(struct usb_hub *hub, struct usb_device *udev, int port1,
>  					USB_DT_DEVICE << 8, 0,
>  					buf, GET_DESCRIPTOR_BUFSIZE,
>  					initial_descriptor_timeout);
> +				/* On systems that use ECC memory, ECC errors can
> +				 * cause the USB controller to halt.
> +				 * It causes this operation to fail. At this time,
> +				 * the buf data is an abnormal value and needs to be exited.
> +				 */
> +				if (r < 0) {
> +					kfree(buf);
> +					goto fail;
> +				}

Are you sure this is correct?  How was this tested?  Seems to me that
this will still return "success" if this code path ever happens, what am
I missing?

thanks,

greg k-h

Alan Stern July 21, 2023, 2:57 p.m. UTC | #2

On Fri, Jul 21, 2023 at 06:00:15PM +0800, liulongfang wrote:
> On systems that use ECC memory. The ECC error of the memory will
> cause the USB controller to halt. It causes the usb_control_msg()
> operation to fail.

How often does this happen in real life?  (Besides, it seems to me that 
if your system is getting a bunch of ECC memory errors then you've got 
much worse problems than a simple USB failure!)

And why do you worry about ECC memory failures in particular?  Can't 
_any_ kind of failure cause the usb_control_msg() operation to fail?

> At this point, the returned buffer data is an abnormal value, and
> continuing to use it will lead to incorrect results.

The driver already contains code to check for abnormal values.  The 
check is not perfect, but it should prevent things from going too 
badly wrong.

> Therefore, it is necessary to judge the return value and exit.
> 
> Signed-off-by: liulongfang <liulongfang@huawei.com>

There is a flaw in your reasoning.

The operation carried out here is deliberately unsafe (for full-speed 
devices).  It is made before we know the actual maxpacket size for ep0, 
and as a result it might return an error code even when it works okay.  
This shouldn't happen, but a lot of USB hardware is unreliable.

Therefore we must not ignore the result merely because r < 0.  If we do 
that, the kernel might stop working with some devices.

Alan Stern

> ---
>  drivers/usb/core/hub.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
> index a739403a9e45..6a43198be263 100644
> --- a/drivers/usb/core/hub.c
> +++ b/drivers/usb/core/hub.c
> @@ -4891,6 +4891,16 @@ hub_port_init(struct usb_hub *hub, struct usb_device *udev, int port1,
>  					USB_DT_DEVICE << 8, 0,
>  					buf, GET_DESCRIPTOR_BUFSIZE,
>  					initial_descriptor_timeout);
> +				/* On systems that use ECC memory, ECC errors can
> +				 * cause the USB controller to halt.
> +				 * It causes this operation to fail. At this time,
> +				 * the buf data is an abnormal value and needs to be exited.
> +				 */
> +				if (r < 0) {
> +					kfree(buf);
> +					goto fail;
> +				}
> +
>  				switch (buf->bMaxPacketSize0) {
>  				case 8: case 16: case 32: case 64: case 255:
>  					if (buf->bDescriptorType ==
> -- 
> 2.24.0
>

Oliver Neukum July 24, 2023, 12:35 p.m. UTC | #3

On 21.07.23 16:57, Alan Stern wrote:
  
> There is a flaw in your reasoning.
> 
> The operation carried out here is deliberately unsafe (for full-speed
> devices).  It is made before we know the actual maxpacket size for ep0,
> and as a result it might return an error code even when it works okay.
> This shouldn't happen, but a lot of USB hardware is unreliable.
> 
> Therefore we must not ignore the result merely because r < 0.  If we do
> that, the kernel might stop working with some devices.

Right. However, we must make sure we are operating on controlled results.
As is we are operating on a random buffer without checking an IO operation
has been performed on it.

	Regards
		Oliver

liulongfang July 26, 2023, 6:44 a.m. UTC | #4

On 2023/7/21 19:08, Greg KH Wrote:
> On Fri, Jul 21, 2023 at 06:00:15PM +0800, liulongfang wrote:
>> On systems that use ECC memory. The ECC error of the memory will
>> cause the USB controller to halt. It causes the usb_control_msg()
>> operation to fail.
> 
> Why does ECC memory matter here?
>

This is a test conducted under a special test scenario.
ECC memory errors are caused by some test tools.

>> At this point, the returned buffer data is an abnormal value, and
>> continuing to use it will lead to incorrect results.
>>
>> Therefore, it is necessary to judge the return value and exit.
>>
>> Signed-off-by: liulongfang <liulongfang@huawei.com>
>> ---
>>  drivers/usb/core/hub.c | 10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
>> index a739403a9e45..6a43198be263 100644
>> --- a/drivers/usb/core/hub.c
>> +++ b/drivers/usb/core/hub.c
>> @@ -4891,6 +4891,16 @@ hub_port_init(struct usb_hub *hub, struct usb_device *udev, int port1,
>>  					USB_DT_DEVICE << 8, 0,
>>  					buf, GET_DESCRIPTOR_BUFSIZE,
>>  					initial_descriptor_timeout);
>> +				/* On systems that use ECC memory, ECC errors can
>> +				 * cause the USB controller to halt.
>> +				 * It causes this operation to fail. At this time,
>> +				 * the buf data is an abnormal value and needs to be exited.
>> +				 */
>> +				if (r < 0) {
>> +					kfree(buf);
>> +					goto fail;
>> +				}
> 
> Are you sure this is correct?  How was this tested?  Seems to me that
> this will still return "success" if this code path ever happens, what am

You are right. I made a patch error here. The code modification should be like this:
if (r < 0) {
	retval = r;
	kfree(buf);
	goto fail;
}

thanks,

Longfang.
> I missing?
> 
> thanks,
> 
> greg k-h
> .
>

liulongfang July 26, 2023, 6:58 a.m. UTC | #5

On 2023/7/21 22:57, Alan Stern Wrote:
> On Fri, Jul 21, 2023 at 06:00:15PM +0800, liulongfang wrote:
>> On systems that use ECC memory. The ECC error of the memory will
>> cause the USB controller to halt. It causes the usb_control_msg()
>> operation to fail.
> 
> How often does this happen in real life?  (Besides, it seems to me that 
> if your system is getting a bunch of ECC memory errors then you've got 
> much worse problems than a simple USB failure!)
>

This problem is on ECC memory platform.
In the test scenario, the problem is 100% reproducible.

> And why do you worry about ECC memory failures in particular?  Can't 
> _any_ kind of failure cause the usb_control_msg() operation to fail?
> 
>> At this point, the returned buffer data is an abnormal value, and
>> continuing to use it will lead to incorrect results.
> 
> The driver already contains code to check for abnormal values.  The 
> check is not perfect, but it should prevent things from going too 
> badly wrong.
>

If it is ECC memory error. These parameter checks would also
actually be invalid.

>> Therefore, it is necessary to judge the return value and exit.
>>
>> Signed-off-by: liulongfang <liulongfang@huawei.com>
> 
> There is a flaw in your reasoning.
> 
> The operation carried out here is deliberately unsafe (for full-speed 
> devices).  It is made before we know the actual maxpacket size for ep0, 
> and as a result it might return an error code even when it works okay.  
> This shouldn't happen, but a lot of USB hardware is unreliable.
> 
> Therefore we must not ignore the result merely because r < 0.  If we do 
> that, the kernel might stop working with some devices.
> 
It may be that the handling solution for ECC errors is different from that
of the OS platform. On the test platform, after usb_control_msg() fails,
reading the memory data of buf will directly lead to kernel crash:

[ T14] Call trace:
[ T14] hub_port_init+0x280/0x9f0
[ T14] hub_port_connect+0x1d4/0xa40
[ T14] hub_port_connect_change+0xb8/0x2b0
[ T14] port_event+0x430/0x5d0
[ T14] hub_event+0x138/0x4a0
[ T14] process_one_work+0x1c8/0x39c
[ T14] worker_thread+0x150/0x3d0
[ T14] kthread+0xfc/0x130
[ T14] ret_from_fork+0x10/0x18
[ T14] Code: 528000c2 b9007fea 94002c9a b9407fea (39401f41)

thanks,
Longfang.
> Alan Stern
> 
>> ---
>>  drivers/usb/core/hub.c | 10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
>> index a739403a9e45..6a43198be263 100644
>> --- a/drivers/usb/core/hub.c
>> +++ b/drivers/usb/core/hub.c
>> @@ -4891,6 +4891,16 @@ hub_port_init(struct usb_hub *hub, struct usb_device *udev, int port1,
>>  					USB_DT_DEVICE << 8, 0,
>>  					buf, GET_DESCRIPTOR_BUFSIZE,
>>  					initial_descriptor_timeout);
>> +				/* On systems that use ECC memory, ECC errors can
>> +				 * cause the USB controller to halt.
>> +				 * It causes this operation to fail. At this time,
>> +				 * the buf data is an abnormal value and needs to be exited.
>> +				 */
>> +				if (r < 0) {
>> +					kfree(buf);
>> +					goto fail;
>> +				}
>> +
>>  				switch (buf->bMaxPacketSize0) {
>>  				case 8: case 16: case 32: case 64: case 255:
>>  					if (buf->bDescriptorType ==
>> -- 
>> 2.24.0
>>
> 
> .
>

Greg Kroah-Hartman July 26, 2023, 7:18 a.m. UTC | #6

On Wed, Jul 26, 2023 at 02:44:01PM +0800, liulongfang wrote:
> On 2023/7/21 19:08, Greg KH Wrote:
> > On Fri, Jul 21, 2023 at 06:00:15PM +0800, liulongfang wrote:
> >> On systems that use ECC memory. The ECC error of the memory will
> >> cause the USB controller to halt. It causes the usb_control_msg()
> >> operation to fail.
> > 
> > Why does ECC memory matter here?
> >
> 
> This is a test conducted under a special test scenario.
> ECC memory errors are caused by some test tools.

What memory is failing, and why does just this single check matter in
the whole kernel?

If hardware is broken, and failing, it's not the job of the kernel to
protect against that, is it?  Shouldn't the ECC memory controller have
properly notified the kernel of the fault and reset the machine because
it is now in an undetermined state?

> > Are you sure this is correct?  How was this tested?  Seems to me that
> > this will still return "success" if this code path ever happens, what am
> 
> You are right. I made a patch error here. The code modification should be like this:
> if (r < 0) {
> 	retval = r;
> 	kfree(buf);
> 	goto fail;
> }

This means that you didn't test this change at all, so I don't really
think it is needed :(

thanks,

greg k-h

Oliver Neukum July 26, 2023, 11:16 a.m. UTC | #7

On 26.07.23 09:18, Greg KH wrote:
> On Wed, Jul 26, 2023 at 02:44:01PM +0800, liulongfang wrote:

>> This is a test conducted under a special test scenario.
>> ECC memory errors are caused by some test tools.

So we are looking at a corner case here.

I think we need to step back, get an overview. And I would
like to apologize for not being entirely helpful.

I see a theoretical possibility here of what is going wrong
and an extremely theoretical bug, but it would be very good
if you could describe your test setup.

So. You are inducing simulated memory errors.
For this scenario to make any sense your failure must be

1. temporary - that is you have detected memory corruption but the RAM cell is not broken
2. unrecoverable - that is we have lost data
3. locateable - that is you know it hit the buffer of this operation and only it

Am I correct so far?

Furthermore your system reports the error to the HC, so that in last
consequence the transfer fails. Right?

> What memory is failing, and why does just this single check matter in
> the whole kernel?

The difference here is that we are deliberately ignoring errors.

> If hardware is broken, and failing, it's not the job of the kernel to
> protect against that, is it?  Shouldn't the ECC memory controller have
> properly notified the kernel of the fault

Definitely it should. But this whole discussion makes only sense
if exactly that happens.

> and reset the machine because
> it is now in an undetermined state?

No. It is not in an undetermined state if your detection logic is good enough.

	Regards
		Oliver

Alan Stern July 26, 2023, 2:20 p.m. UTC | #8

On Wed, Jul 26, 2023 at 02:58:18PM +0800, liulongfang wrote:
> On 2023/7/21 22:57, Alan Stern Wrote:
> > On Fri, Jul 21, 2023 at 06:00:15PM +0800, liulongfang wrote:
> >> On systems that use ECC memory. The ECC error of the memory will
> >> cause the USB controller to halt. It causes the usb_control_msg()
> >> operation to fail.
> > 
> > How often does this happen in real life?  (Besides, it seems to me that 
> > if your system is getting a bunch of ECC memory errors then you've got 
> > much worse problems than a simple USB failure!)
> >
> 
> This problem is on ECC memory platform.
> In the test scenario, the problem is 100% reproducible.
> 
> > And why do you worry about ECC memory failures in particular?  Can't 
> > _any_ kind of failure cause the usb_control_msg() operation to fail?
> > 
> >> At this point, the returned buffer data is an abnormal value, and
> >> continuing to use it will lead to incorrect results.
> > 
> > The driver already contains code to check for abnormal values.  The 
> > check is not perfect, but it should prevent things from going too 
> > badly wrong.
> >
> 
> If it is ECC memory error. These parameter checks would also
> actually be invalid.
> 
> >> Therefore, it is necessary to judge the return value and exit.
> >>
> >> Signed-off-by: liulongfang <liulongfang@huawei.com>
> > 
> > There is a flaw in your reasoning.
> > 
> > The operation carried out here is deliberately unsafe (for full-speed 
> > devices).  It is made before we know the actual maxpacket size for ep0, 
> > and as a result it might return an error code even when it works okay.  
> > This shouldn't happen, but a lot of USB hardware is unreliable.
> > 
> > Therefore we must not ignore the result merely because r < 0.  If we do 
> > that, the kernel might stop working with some devices.
> > 
> It may be that the handling solution for ECC errors is different from that
> of the OS platform. On the test platform, after usb_control_msg() fails,
> reading the memory data of buf will directly lead to kernel crash:

All right, then here's a proposal for a different way to solve the 
problem: Change the kernel's handler for the ECC error notification.  
Have it clear the affected parts of memory, so that the kernel can go 
ahead and use them without crashing.

It seems to me that something along these lines must be necessary in 
any case.  Unless the bad memory is cleared somehow, it would never be 
usable again.  The kernel might deallocate it, then reallocate for 
another purpose, and then crash when the new user tries to access it.  

In fact, this scenario could still happen even with your patch, which 
means the patch doesn't really fix the problem.

Alan Stern

liulongfang July 27, 2023, 4:02 a.m. UTC | #9

On 2023/7/26 15:18, Greg KH wrote:
> On Wed, Jul 26, 2023 at 02:44:01PM +0800, liulongfang wrote:
>> On 2023/7/21 19:08, Greg KH Wrote:
>>> On Fri, Jul 21, 2023 at 06:00:15PM +0800, liulongfang wrote:
>>>> On systems that use ECC memory. The ECC error of the memory will
>>>> cause the USB controller to halt. It causes the usb_control_msg()
>>>> operation to fail.
>>>
>>> Why does ECC memory matter here?
>>>
>>
>> This is a test conducted under a special test scenario.
>> ECC memory errors are caused by some test tools.
> 
> What memory is failing, and why does just this single check matter in
> the whole kernel?
> 
> If hardware is broken, and failing, it's not the job of the kernel to
> protect against that, is it?  Shouldn't the ECC memory controller have
> properly notified the kernel of the fault and reset the machine because
> it is now in an undetermined state?
> 
>>> Are you sure this is correct?  How was this tested?  Seems to me that
>>> this will still return "success" if this code path ever happens, what am
>>
>> You are right. I made a patch error here. The code modification should be like this:
>> if (r < 0) {
>> 	retval = r;
>> 	kfree(buf);
>> 	goto fail;
>> }
> 
> This means that you didn't test this change at all, so I don't really
> think it is needed :(
> 
In fact, currently we have not added this retval assignment operation.
Due to the circumvention of buf access during patch testing.
This problem causes calltrace not to trigger.

Thanks.
Longfang.
> thanks,
> 
> greg k-h
> .
>

liulongfang July 27, 2023, 7 a.m. UTC | #10

On 2023/7/26 19:16, Oliver Neukum wrote:
> On 26.07.23 09:18, Greg KH wrote:
>> On Wed, Jul 26, 2023 at 02:44:01PM +0800, liulongfang wrote:
> 
>>> This is a test conducted under a special test scenario.
>>> ECC memory errors are caused by some test tools.
> 
> So we are looking at a corner case here.
> 
> I think we need to step back, get an overview. And I would
> like to apologize for not being entirely helpful.
> 
> I see a theoretical possibility here of what is going wrong
> and an extremely theoretical bug, but it would be very good
> if you could describe your test setup.
> 
> So. You are inducing simulated memory errors.
> For this scenario to make any sense your failure must be
> 
> 1. temporary - that is you have detected memory corruption but the RAM cell is not broken
> 2. unrecoverable - that is we have lost data
> 3. locateable - that is you know it hit the buffer of this operation and only it
> 
> Am I correct so far?
>
You are right about the testing process.
But this problem can exist in the real environment, just the probability of
occurrence is very low.
Use a test tool just to make it easier to trigger it.

> Furthermore your system reports the error to the HC, so that in last
> consequence the transfer fails. Right?
> 
>> What memory is failing, and why does just this single check matter in
>> the whole kernel?
> 
> The difference here is that we are deliberately ignoring errors.
>  
>> If hardware is broken, and failing, it's not the job of the kernel to
>> protect against that, is it?  Shouldn't the ECC memory controller have
>> properly notified the kernel of the fault
> 
> Definitely it should. But this whole discussion makes only sense
> if exactly that happens.
>
Our test tool only simulates that external interference destroys this part
of the data in the buffer on the ECC memory. Even without this testing tool.
This problem may also occur on real business hardware devices.

>> and reset the machine because
>> it is now in an undetermined state?
> 
> No. It is not in an undetermined state if your detection logic is good enough.
> 
>     Regards
>         Oliver
> .
> 
Thanks,
Longfang.

liulongfang July 27, 2023, 7:03 a.m. UTC | #11

On 2023/7/26 22:20, Alan Stern wrote:
> On Wed, Jul 26, 2023 at 02:58:18PM +0800, liulongfang wrote:
>> On 2023/7/21 22:57, Alan Stern Wrote:
>>> On Fri, Jul 21, 2023 at 06:00:15PM +0800, liulongfang wrote:
>>>> On systems that use ECC memory. The ECC error of the memory will
>>>> cause the USB controller to halt. It causes the usb_control_msg()
>>>> operation to fail.
>>>
>>> How often does this happen in real life?  (Besides, it seems to me that 
>>> if your system is getting a bunch of ECC memory errors then you've got 
>>> much worse problems than a simple USB failure!)
>>>
>>
>> This problem is on ECC memory platform.
>> In the test scenario, the problem is 100% reproducible.
>>
>>> And why do you worry about ECC memory failures in particular?  Can't 
>>> _any_ kind of failure cause the usb_control_msg() operation to fail?
>>>
>>>> At this point, the returned buffer data is an abnormal value, and
>>>> continuing to use it will lead to incorrect results.
>>>
>>> The driver already contains code to check for abnormal values.  The 
>>> check is not perfect, but it should prevent things from going too 
>>> badly wrong.
>>>
>>
>> If it is ECC memory error. These parameter checks would also
>> actually be invalid.
>>
>>>> Therefore, it is necessary to judge the return value and exit.
>>>>
>>>> Signed-off-by: liulongfang <liulongfang@huawei.com>
>>>
>>> There is a flaw in your reasoning.
>>>
>>> The operation carried out here is deliberately unsafe (for full-speed 
>>> devices).  It is made before we know the actual maxpacket size for ep0, 
>>> and as a result it might return an error code even when it works okay.  
>>> This shouldn't happen, but a lot of USB hardware is unreliable.
>>>
>>> Therefore we must not ignore the result merely because r < 0.  If we do 
>>> that, the kernel might stop working with some devices.
>>>
>> It may be that the handling solution for ECC errors is different from that
>> of the OS platform. On the test platform, after usb_control_msg() fails,
>> reading the memory data of buf will directly lead to kernel crash:
> 
> All right, then here's a proposal for a different way to solve the 
> problem: Change the kernel's handler for the ECC error notification.  
> Have it clear the affected parts of memory, so that the kernel can go 
> ahead and use them without crashing.
> 
> It seems to me that something along these lines must be necessary in 
> any case.  Unless the bad memory is cleared somehow, it would never be 
> usable again.  The kernel might deallocate it, then reallocate for 
> another purpose, and then crash when the new user tries to access it.  
> 
> In fact, this scenario could still happen even with your patch, which 
> means the patch doesn't really fix the problem.
> 

This patch is only used to prevent data in the buffer from being accessed.
As long as the data is not accessed, the kernel does not crash.

thanks,
Longfang.

> Alan Stern
> 
> .
>

Oliver Neukum July 27, 2023, 8:46 a.m. UTC | #12

On 27.07.23 09:00, liulongfang wrote:
> On 2023/7/26 19:16, Oliver Neukum wrote:

>> 1. temporary - that is you have detected memory corruption but the RAM cell is not broken
>> 2. unrecoverable - that is we have lost data
>> 3. locateable - that is you know it hit the buffer of this operation and only it
>>
>> Am I correct so far?
>>
> You are right about the testing process.
> But this problem can exist in the real environment, just the probability of
> occurrence is very low.

Understood. Bit flips are random.

But this leaves two open questions.

1. How is the error reported

2. How are we supposed to handle it

Firstly, if we already know that there is an ECC failure
on the host we can use a specific error code and can check
for that.

Secondly, does this mean that the affected memory location
must not be touched until the machine is power cycled
or does it simply mean that the buffer is invalid?

> Our test tool only simulates that external interference destroys this part
> of the data in the buffer on the ECC memory. Even without this testing tool.
> This problem may also occur on real business hardware devices.

Understood. But what is the correct remedy if teh problem strikes
for real?

	Regards
		Oliver

Oliver Neukum July 27, 2023, 8:47 a.m. UTC | #13

On 27.07.23 09:03, liulongfang wrote:

> This patch is only used to prevent data in the buffer from being accessed.
> As long as the data is not accessed, the kernel does not crash.

For how long? That information is cruical.

	Regards
		Oliver

Alan Stern July 27, 2023, 2:42 p.m. UTC | #14

On Thu, Jul 27, 2023 at 03:03:57PM +0800, liulongfang wrote:
> On 2023/7/26 22:20, Alan Stern wrote:
> >> It may be that the handling solution for ECC errors is different from that
> >> of the OS platform. On the test platform, after usb_control_msg() fails,
> >> reading the memory data of buf will directly lead to kernel crash:
> > 
> > All right, then here's a proposal for a different way to solve the 
> > problem: Change the kernel's handler for the ECC error notification.  
> > Have it clear the affected parts of memory, so that the kernel can go 
> > ahead and use them without crashing.
> > 
> > It seems to me that something along these lines must be necessary in 
> > any case.  Unless the bad memory is cleared somehow, it would never be 
> > usable again.  The kernel might deallocate it, then reallocate for 
> > another purpose, and then crash when the new user tries to access it.  
> > 
> > In fact, this scenario could still happen even with your patch, which 
> > means the patch doesn't really fix the problem.
> > 
> 
> This patch is only used to prevent data in the buffer from being accessed.
> As long as the data is not accessed, the kernel does not crash.

I still don't understand.  You haven't provided nearly enough 
information.  You should start by answering the questions that Oliver 
asked.  Then answer this question:

The code you are concerned about is this:

		r = usb_control_msg(udev, usb_rcvaddr0pipe(),
				USB_REQ_GET_DESCRIPTOR, USB_DIR_IN,
				USB_DT_DEVICE << 8, 0,
				buf, GET_DESCRIPTOR_BUFSIZE,
				initial_descriptor_timeout);
		switch (buf->bMaxPacketSize0) {

You're worried that if an ECC memory error occurs during the 
usb_control_msg transfer, the kernel will crash when the "switch" 
statement tries to read the value of buf->bMaxPacketSize0.  That's a 
reasonable thing to worry about.

Now think about what will happen if usb_control_msg works successfully 
but an ECC memory error occurs when the return code from the function 
call is stored in r?  Won't the kernel crash then?  Or if not then, when 
it reads the value of r a few lines later?

So why bother to handle the first kind of ECC error but not the second?  
What makes one ECC error more important than another?

Alan Stern

Oliver Neukum July 27, 2023, 3:31 p.m. UTC | #15

On 27.07.23 16:42, Alan Stern wrote:
> On Thu, Jul 27, 2023 at 03:03:57PM +0800, liulongfang wrote:
>> On 2023/7/26 22:20, Alan Stern wrote:

>>> It seems to me that something along these lines must be necessary in
>>> any case.  Unless the bad memory is cleared somehow, it would never be
>>> usable again.  The kernel might deallocate it, then reallocate for
>>> another purpose, and then crash when the new user tries to access it.
>>>
>>> In fact, this scenario could still happen even with your patch, which
>>> means the patch doesn't really fix the problem.

I suppose in theory you could have something like a bad blocks list
just for RAM, but that would really hurt. You'd have to do something
about every DMA operation in every driver in theory.

Error handling would basically be an intentional memory leak.

>> This patch is only used to prevent data in the buffer from being accessed.
>> As long as the data is not accessed, the kernel does not crash.
> 
> I still don't understand.  You haven't provided nearly enough
> information.  You should start by answering the questions that Oliver
> asked.  Then answer this question:
> 
> The code you are concerned about is this:
> 
> 		r = usb_control_msg(udev, usb_rcvaddr0pipe(),
> 				USB_REQ_GET_DESCRIPTOR, USB_DIR_IN,
> 				USB_DT_DEVICE << 8, 0,
> 				buf, GET_DESCRIPTOR_BUFSIZE,
> 				initial_descriptor_timeout);
> 		switch (buf->bMaxPacketSize0) {
> 
> You're worried that if an ECC memory error occurs during the
> usb_control_msg transfer, the kernel will crash when the "switch"
> statement tries to read the value of buf->bMaxPacketSize0.  That's a
> reasonable thing to worry about.

Albeit unlikely. If the hardware and implementation are reasonable
you'd return a specific error code from the HCD and clean up the
RAM in your ecc driver.

The fix for USB would then conceptually be something like

retryio:
	r = usb_control_msg()
	if (r == -EMEMORYCORRUPTION)
		goto retryio;

	Regards
		Oliver

Alan Stern July 27, 2023, 3:57 p.m. UTC | #16

On Thu, Jul 27, 2023 at 05:31:41PM +0200, Oliver Neukum wrote:
> On 27.07.23 16:42, Alan Stern wrote:
> > On Thu, Jul 27, 2023 at 03:03:57PM +0800, liulongfang wrote:
> > > On 2023/7/26 22:20, Alan Stern wrote:
> 
> > > > It seems to me that something along these lines must be necessary in
> > > > any case.  Unless the bad memory is cleared somehow, it would never be
> > > > usable again.  The kernel might deallocate it, then reallocate for
> > > > another purpose, and then crash when the new user tries to access it.
> > > > 
> > > > In fact, this scenario could still happen even with your patch, which
> > > > means the patch doesn't really fix the problem.
> 
> I suppose in theory you could have something like a bad blocks list
> just for RAM, but that would really hurt. You'd have to do something
> about every DMA operation in every driver in theory.
> 
> Error handling would basically be an intentional memory leak.

I started out thinking this way, but maybe that's not how it works.  
Perhaps simply overwriting the part of memory that got the ECC error 
would clear the error state.  (This may depend on the kind of error, 
one-time vs. permanent.)

If that's the case, and if the memory buffer was deallocated without 
being accessed and then later reallocated, things would be okay.  The 
routine that reallocated the buffer wouldn't try to read from it before 
initializing it somehow.

> > > This patch is only used to prevent data in the buffer from being accessed.
> > > As long as the data is not accessed, the kernel does not crash.
> > 
> > I still don't understand.  You haven't provided nearly enough
> > information.  You should start by answering the questions that Oliver
> > asked.  Then answer this question:
> > 
> > The code you are concerned about is this:
> > 
> > 		r = usb_control_msg(udev, usb_rcvaddr0pipe(),
> > 				USB_REQ_GET_DESCRIPTOR, USB_DIR_IN,
> > 				USB_DT_DEVICE << 8, 0,
> > 				buf, GET_DESCRIPTOR_BUFSIZE,
> > 				initial_descriptor_timeout);
> > 		switch (buf->bMaxPacketSize0) {
> > 
> > You're worried that if an ECC memory error occurs during the
> > usb_control_msg transfer, the kernel will crash when the "switch"
> > statement tries to read the value of buf->bMaxPacketSize0.  That's a
> > reasonable thing to worry about.
> 
> Albeit unlikely. If the hardware and implementation are reasonable
> you'd return a specific error code from the HCD and clean up the
> RAM in your ecc driver.
> 
> The fix for USB would then conceptually be something like
> 
> retryio:
> 	r = usb_control_msg()
> 	if (r == -EMEMORYCORRUPTION)
> 		goto retryio;

Yes, we could do this, but it's not necessary.  Let's say that the HCD 
returns -EMEMORYCORRUPTION and the ecc driver cleans up the RAM 
(probably by resetting its contents to 0, but possibly leaving garbage 
there instead).  Then when the following code in hub_port_init() tests 
buf->bMaxPacketSize0, it will see an invalid value and will retry the 
transfer.

Or, with low probability, it will see a valid but incorrect value.  If 
that happens then later transfers using ep0 will fail, causing the hub 
driver to reiterate the outer loop in hub_port_connect().  Eventually 
the device will be correctly initialized and enumerated.

Alan Stern

liulongfang Aug. 10, 2023, 1:20 a.m. UTC | #17

On 2023/7/27 23:57, Alan Stern wrote:
> On Thu, Jul 27, 2023 at 05:31:41PM +0200, Oliver Neukum wrote:
>> On 27.07.23 16:42, Alan Stern wrote:
>>> On Thu, Jul 27, 2023 at 03:03:57PM +0800, liulongfang wrote:
>>>> On 2023/7/26 22:20, Alan Stern wrote:
>>
>>>>> It seems to me that something along these lines must be necessary in
>>>>> any case.  Unless the bad memory is cleared somehow, it would never be
>>>>> usable again.  The kernel might deallocate it, then reallocate for
>>>>> another purpose, and then crash when the new user tries to access it.
>>>>>
>>>>> In fact, this scenario could still happen even with your patch, which
>>>>> means the patch doesn't really fix the problem.
>>
>> I suppose in theory you could have something like a bad blocks list
>> just for RAM, but that would really hurt. You'd have to do something
>> about every DMA operation in every driver in theory.
>>
>> Error handling would basically be an intentional memory leak.
> 
> I started out thinking this way, but maybe that's not how it works.  
> Perhaps simply overwriting the part of memory that got the ECC error 
> would clear the error state.  (This may depend on the kind of error, 
> one-time vs. permanent.)
> 
> If that's the case, and if the memory buffer was deallocated without 
> being accessed and then later reallocated, things would be okay.  The 
> routine that reallocated the buffer wouldn't try to read from it before 
> initializing it somehow.
> 
>>>> This patch is only used to prevent data in the buffer from being accessed.
>>>> As long as the data is not accessed, the kernel does not crash.
>>>
>>> I still don't understand.  You haven't provided nearly enough
>>> information.  You should start by answering the questions that Oliver
>>> asked.  Then answer this question:
>>>
>>> The code you are concerned about is this:
>>>
>>> 		r = usb_control_msg(udev, usb_rcvaddr0pipe(),
>>> 				USB_REQ_GET_DESCRIPTOR, USB_DIR_IN,
>>> 				USB_DT_DEVICE << 8, 0,
>>> 				buf, GET_DESCRIPTOR_BUFSIZE,
>>> 				initial_descriptor_timeout);
>>> 		switch (buf->bMaxPacketSize0) {
>>>
>>> You're worried that if an ECC memory error occurs during the
>>> usb_control_msg transfer, the kernel will crash when the "switch"
>>> statement tries to read the value of buf->bMaxPacketSize0.  That's a
>>> reasonable thing to worry about.
>>
>> Albeit unlikely. If the hardware and implementation are reasonable
>> you'd return a specific error code from the HCD and clean up the
>> RAM in your ecc driver.
>>
>> The fix for USB would then conceptually be something like
>>
>> retryio:
>> 	r = usb_control_msg()
>> 	if (r == -EMEMORYCORRUPTION)
>> 		goto retryio;
> 
> Yes, we could do this, but it's not necessary.  Let's say that the HCD 
> returns -EMEMORYCORRUPTION and the ecc driver cleans up the RAM 
> (probably by resetting its contents to 0, but possibly leaving garbage 
> there instead).  Then when the following code in hub_port_init() tests 
> buf->bMaxPacketSize0, it will see an invalid value and will retry the 
> transfer.
> 
> Or, with low probability, it will see a valid but incorrect value.  If 
> that happens then later transfers using ep0 will fail, causing the hub 
> driver to reiterate the outer loop in hub_port_connect().  Eventually 
> the device will be correctly initialized and enumerated.
> 
> Alan Stern
>

OK, thanks.
Longfang.
> .
>

USB:bugfix a controller halt error

Commit Message

Comments

Patch