diff mbox series

hwrng: core - Add WARN_ON for buggy read return values

Message ID ZvEFQAWVgWNd9j7e@gondor.apana.org.au (mailing list archive)
State Accepted
Delegated to: Herbert Xu
Headers show
Series hwrng: core - Add WARN_ON for buggy read return values | expand

Commit Message

Herbert Xu Sept. 23, 2024, 6:05 a.m. UTC
Dear TPM maintainers:

Please have a look at the tpm hwrng driver because it appears to
be returning a length longer than the buffer length that we gave
it.  In particular, tpm2 appears to be the culprit (though I didn't
really check tpm1 at all so it could also be buggy).

The following patch hopefully should confirm that this is indeed
caused by TPM and not some other HWRNG driver.

---8<---
If a buggy driver returns a length that is longer than the size
of the buffer provided to it, then this may lead to a buffer overread
in the caller.

Stop this by adding a check for it in the hwrng core.

Reported-by: Guangwu Zhang <guazhang@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Comments

Jarkko Sakkinen Sept. 23, 2024, 7:52 a.m. UTC | #1
On Mon Sep 23, 2024 at 9:05 AM EEST, Herbert Xu wrote:
> Dear TPM maintainers:

There's really only just me (for past 10 years). Maybe that should be
updatred.

>
> Please have a look at the tpm hwrng driver because it appears to
> be returning a length longer than the buffer length that we gave
> it.  In particular, tpm2 appears to be the culprit (though I didn't
> really check tpm1 at all so it could also be buggy).
>
> The following patch hopefully should confirm that this is indeed
> caused by TPM and not some other HWRNG driver.




>
> ---8<---
> If a buggy driver returns a length that is longer than the size
> of the buffer provided to it, then this may lead to a buffer overread
> in the caller.
>
> Stop this by adding a check for it in the hwrng core.
>
> Reported-by: Guangwu Zhang <guazhang@redhat.com>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
>
> diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
> index 57c51efa5613..018316f54621 100644
> --- a/drivers/char/hw_random/core.c
> +++ b/drivers/char/hw_random/core.c
> @@ -181,8 +181,15 @@ static inline int rng_get_data(struct hwrng *rng, u8 *buffer, size_t size,
>  	int present;
>  
>  	BUG_ON(!mutex_is_locked(&reading_mutex));
> -	if (rng->read)
> -		return rng->read(rng, (void *)buffer, size, wait);
> +	if (rng->read) {
> +		int err;
> +
> +		err = rng->read(rng, buffer, size, wait);
> +		if (WARN_ON_ONCE(err > 0 && err > size))

Are you sure you want to use WARN_ON_ONCE here instead of
pr_warn_once()? I.e. is it worth of taking down the whole
kernel?

> +			err = size;
> +
> +		return err;
> +	}
>  
>  	if (rng->data_present)
>  		present = rng->data_present(rng, wait);

BR, Jarkko
Jarkko Sakkinen Sept. 23, 2024, 8:07 a.m. UTC | #2
On Mon Sep 23, 2024 at 10:52 AM EEST, Jarkko Sakkinen wrote:
> On Mon Sep 23, 2024 at 9:05 AM EEST, Herbert Xu wrote:
> > Dear TPM maintainers:
>
> There's really only just me (for past 10 years). Maybe that should be
> updatred.
>
> >
> > Please have a look at the tpm hwrng driver because it appears to
> > be returning a length longer than the buffer length that we gave
> > it.  In particular, tpm2 appears to be the culprit (though I didn't
> > really check tpm1 at all so it could also be buggy).
> >
> > The following patch hopefully should confirm that this is indeed
> > caused by TPM and not some other HWRNG driver.
>
>
>
>
> >
> > ---8<---
> > If a buggy driver returns a length that is longer than the size
> > of the buffer provided to it, then this may lead to a buffer overread
> > in the caller.
> >
> > Stop this by adding a check for it in the hwrng core.
> >
> > Reported-by: Guangwu Zhang <guazhang@redhat.com>
> > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> >
> > diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
> > index 57c51efa5613..018316f54621 100644
> > --- a/drivers/char/hw_random/core.c
> > +++ b/drivers/char/hw_random/core.c
> > @@ -181,8 +181,15 @@ static inline int rng_get_data(struct hwrng *rng, u8 *buffer, size_t size,
> >  	int present;
> >  
> >  	BUG_ON(!mutex_is_locked(&reading_mutex));
> > -	if (rng->read)
> > -		return rng->read(rng, (void *)buffer, size, wait);
> > +	if (rng->read) {
> > +		int err;
> > +
> > +		err = rng->read(rng, buffer, size, wait);
> > +		if (WARN_ON_ONCE(err > 0 && err > size))
>
> Are you sure you want to use WARN_ON_ONCE here instead of
> pr_warn_once()? I.e. is it worth of taking down the whole
> kernel?

I looked at tpm2_get_random() and it is pretty inefficient (not same
as saying it has a bug). I'd love to reimplement it.

We would be better of it would pull random let's say with 256 byte
or 512 byte chunks and cache that internal to tpm_chip. Then the
requests would be served from that pre-fetched pool and not do
round-trip to TPM every single time.

This would improve overall kernel performance given the reduced
number of wait states. I wonder if anyone knows what would be a
good fetch size that will work on most TPM2 chips?

BR, Jarkko
Jarkko Sakkinen Sept. 23, 2024, 8:09 a.m. UTC | #3
On Mon Sep 23, 2024 at 11:07 AM EEST, Jarkko Sakkinen wrote:
> On Mon Sep 23, 2024 at 10:52 AM EEST, Jarkko Sakkinen wrote:
> > On Mon Sep 23, 2024 at 9:05 AM EEST, Herbert Xu wrote:
> > > Dear TPM maintainers:
> >
> > There's really only just me (for past 10 years). Maybe that should be
> > updatred.
> >
> > >
> > > Please have a look at the tpm hwrng driver because it appears to
> > > be returning a length longer than the buffer length that we gave
> > > it.  In particular, tpm2 appears to be the culprit (though I didn't
> > > really check tpm1 at all so it could also be buggy).
> > >
> > > The following patch hopefully should confirm that this is indeed
> > > caused by TPM and not some other HWRNG driver.
> >
> >
> >
> >
> > >
> > > ---8<---
> > > If a buggy driver returns a length that is longer than the size
> > > of the buffer provided to it, then this may lead to a buffer overread
> > > in the caller.
> > >
> > > Stop this by adding a check for it in the hwrng core.
> > >
> > > Reported-by: Guangwu Zhang <guazhang@redhat.com>
> > > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> > >
> > > diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
> > > index 57c51efa5613..018316f54621 100644
> > > --- a/drivers/char/hw_random/core.c
> > > +++ b/drivers/char/hw_random/core.c
> > > @@ -181,8 +181,15 @@ static inline int rng_get_data(struct hwrng *rng, u8 *buffer, size_t size,
> > >  	int present;
> > >  
> > >  	BUG_ON(!mutex_is_locked(&reading_mutex));
> > > -	if (rng->read)
> > > -		return rng->read(rng, (void *)buffer, size, wait);
> > > +	if (rng->read) {
> > > +		int err;
> > > +
> > > +		err = rng->read(rng, buffer, size, wait);
> > > +		if (WARN_ON_ONCE(err > 0 && err > size))
> >
> > Are you sure you want to use WARN_ON_ONCE here instead of
> > pr_warn_once()? I.e. is it worth of taking down the whole
> > kernel?
>
> I looked at tpm2_get_random() and it is pretty inefficient (not same
> as saying it has a bug). I'd love to reimplement it.
>
> We would be better of it would pull random let's say with 256 byte
> or 512 byte chunks and cache that internal to tpm_chip. Then the
> requests would be served from that pre-fetched pool and not do
> round-trip to TPM every single time.
>
> This would improve overall kernel performance given the reduced
> number of wait states. I wonder if anyone knows what would be a
> good fetch size that will work on most TPM2 chips?

Herbert, I'm going to go to gym but I could help you to debug the
issue you're seeing with a patch from tpm2_get_random(). We need
to fix that one first ofc (as you never should build on top of
broken code). Once back from gym and grocery shopping I'll bake
a debugging patch.

BR, Jarkko
Herbert Xu Sept. 23, 2024, 9:26 a.m. UTC | #4
On Mon, Sep 23, 2024 at 10:52:49AM +0300, Jarkko Sakkinen wrote:
> On Mon Sep 23, 2024 at 9:05 AM EEST, Herbert Xu wrote:
> > Dear TPM maintainers:
> 
> There's really only just me (for past 10 years). Maybe that should be
> updatred.

:)

> >  	BUG_ON(!mutex_is_locked(&reading_mutex));
> > -	if (rng->read)
> > -		return rng->read(rng, (void *)buffer, size, wait);
> > +	if (rng->read) {
> > +		int err;
> > +
> > +		err = rng->read(rng, buffer, size, wait);
> > +		if (WARN_ON_ONCE(err > 0 && err > size))
> 
> Are you sure you want to use WARN_ON_ONCE here instead of
> pr_warn_once()? I.e. is it worth of taking down the whole
> kernel?

Absolutely.  If this triggers it's a serious kernel bug and we
should gather as much information as possible.  pr_warn_once is
not the same thing as WARN_ON_ONCE in terms of what it prints.

If people want to turn WARNs into BUGs, then they've only got
themselves to blame when the kernel goes down.  On the other
hand perhaps they *do* want this to panic and we should hand
it to them.

Thanks,
Jarkko Sakkinen Sept. 23, 2024, 2:31 p.m. UTC | #5
On Mon Sep 23, 2024 at 12:26 PM EEST, Herbert Xu wrote:
 > > +
> > > +		err = rng->read(rng, buffer, size, wait);
> > > +		if (WARN_ON_ONCE(err > 0 && err > size))
> > 
> > Are you sure you want to use WARN_ON_ONCE here instead of
> > pr_warn_once()? I.e. is it worth of taking down the whole
> > kernel?
>
> Absolutely.  If this triggers it's a serious kernel bug and we
> should gather as much information as possible.  pr_warn_once is
> not the same thing as WARN_ON_ONCE in terms of what it prints.

Personally I allow the use of WARN only as the last resort.

If you need stack printout you can always use dump_stack().

>
> If people want to turn WARNs into BUGs, then they've only got
> themselves to blame when the kernel goes down.  On the other
> hand perhaps they *do* want this to panic and we should hand
> it to them.

Actually when you turn on "panic_on_warn" the user expectation is and
should be that the sites where WARN is used have been hand picked with
consideration so that panic happens for a reason.

This has also been denoted repeatedly by Greg:

https://lore.kernel.org/linux-cve-announce/2024061828-CVE-2024-36975-6719@gregkh/

I should check this somewhere but actually these days a wrongly chosen
WARN() might lead to CVE entry. That fix was by me but I never created
the CVE.

Greg, did we have something under Documentation/ that would fully
address the use of WARN?

BR, Jarkko
Jarkko Sakkinen Sept. 23, 2024, 2:36 p.m. UTC | #6
On Mon Sep 23, 2024 at 5:31 PM EEST, Jarkko Sakkinen wrote:
> Greg, did we have something under Documentation/ that would fully
> address the use of WARN?

... personally I think that even if there was a separate document, this
should be somehow addressed already in SubmittingPatches so that it
can't be possibly missed by anyone.

BR, Jarkko
Greg KH Sept. 23, 2024, 2:48 p.m. UTC | #7
On Mon, Sep 23, 2024 at 05:31:47PM +0300, Jarkko Sakkinen wrote:
> On Mon Sep 23, 2024 at 12:26 PM EEST, Herbert Xu wrote:
>  > > +
> > > > +		err = rng->read(rng, buffer, size, wait);
> > > > +		if (WARN_ON_ONCE(err > 0 && err > size))
> > > 
> > > Are you sure you want to use WARN_ON_ONCE here instead of
> > > pr_warn_once()? I.e. is it worth of taking down the whole
> > > kernel?
> >
> > Absolutely.  If this triggers it's a serious kernel bug and we
> > should gather as much information as possible.  pr_warn_once is
> > not the same thing as WARN_ON_ONCE in terms of what it prints.
> 
> Personally I allow the use of WARN only as the last resort.
> 
> If you need stack printout you can always use dump_stack().
> 
> >
> > If people want to turn WARNs into BUGs, then they've only got
> > themselves to blame when the kernel goes down.  On the other
> > hand perhaps they *do* want this to panic and we should hand
> > it to them.
> 
> Actually when you turn on "panic_on_warn" the user expectation is and
> should be that the sites where WARN is used have been hand picked with
> consideration so that panic happens for a reason.
> 
> This has also been denoted repeatedly by Greg:
> 
> https://lore.kernel.org/linux-cve-announce/2024061828-CVE-2024-36975-6719@gregkh/
> 
> I should check this somewhere but actually these days a wrongly chosen
> WARN() might lead to CVE entry. That fix was by me but I never created
> the CVE.
> 
> Greg, did we have something under Documentation/ that would fully
> address the use of WARN?

Please see:
	https://www.kernel.org/doc/html/latest/process/deprecated.html#bug-and-bug-on
which describes that.  We should make it more explicit that any WARN()
or WARN_ON() calls that can be hit by user interactions somehow, will
end up getting a CVE id when we fix it up to not do so.

thanks,

greg k-h
Jarkko Sakkinen Sept. 23, 2024, 8:46 p.m. UTC | #8
On Mon Sep 23, 2024 at 5:48 PM EEST, Greg KH wrote:
> On Mon, Sep 23, 2024 at 05:31:47PM +0300, Jarkko Sakkinen wrote:
> > On Mon Sep 23, 2024 at 12:26 PM EEST, Herbert Xu wrote:
> >  > > +
> > > > > +		err = rng->read(rng, buffer, size, wait);
> > > > > +		if (WARN_ON_ONCE(err > 0 && err > size))
> > > > 
> > > > Are you sure you want to use WARN_ON_ONCE here instead of
> > > > pr_warn_once()? I.e. is it worth of taking down the whole
> > > > kernel?
> > >
> > > Absolutely.  If this triggers it's a serious kernel bug and we
> > > should gather as much information as possible.  pr_warn_once is
> > > not the same thing as WARN_ON_ONCE in terms of what it prints.
> > 
> > Personally I allow the use of WARN only as the last resort.
> > 
> > If you need stack printout you can always use dump_stack().
> > 
> > >
> > > If people want to turn WARNs into BUGs, then they've only got
> > > themselves to blame when the kernel goes down.  On the other
> > > hand perhaps they *do* want this to panic and we should hand
> > > it to them.
> > 
> > Actually when you turn on "panic_on_warn" the user expectation is and
> > should be that the sites where WARN is used have been hand picked with
> > consideration so that panic happens for a reason.
> > 
> > This has also been denoted repeatedly by Greg:
> > 
> > https://lore.kernel.org/linux-cve-announce/2024061828-CVE-2024-36975-6719@gregkh/
> > 
> > I should check this somewhere but actually these days a wrongly chosen
> > WARN() might lead to CVE entry. That fix was by me but I never created
> > the CVE.
> > 
> > Greg, did we have something under Documentation/ that would fully
> > address the use of WARN?
>
> Please see:
> 	https://www.kernel.org/doc/html/latest/process/deprecated.html#bug-and-bug-on
> which describes that.  We should make it more explicit that any WARN()
> or WARN_ON() calls that can be hit by user interactions somehow, will
> end up getting a CVE id when we fix it up to not do so.

I bookmarked this thanks :-)

Herbert, I'll do comprehensive testing tmrw by adding some invariants to
tpm2_get_random(). I'd really love to reimplement it because the current
implementation frankly sucks (and it's by me) but yep, we nee to fix it
first and foremost.

>
> thanks,
>
> greg k-h


BR, Jarkko
Herbert Xu Sept. 23, 2024, 10:32 p.m. UTC | #9
On Mon, Sep 23, 2024 at 04:48:27PM +0200, Greg KH wrote:
>
> Please see:
> 	https://www.kernel.org/doc/html/latest/process/deprecated.html#bug-and-bug-on
> which describes that.  We should make it more explicit that any WARN()
> or WARN_ON() calls that can be hit by user interactions somehow, will
> end up getting a CVE id when we fix it up to not do so.

If the aformentioned WARN_ON hits, then the driver has probabaly
already done a buffer overrun so it's a CVE anyway.

Cheers,
Jarkko Sakkinen Sept. 24, 2024, 4:05 p.m. UTC | #10
On Tue Sep 24, 2024 at 1:32 AM EEST, Herbert Xu wrote:
> On Mon, Sep 23, 2024 at 04:48:27PM +0200, Greg KH wrote:
> >
> > Please see:
> > 	https://www.kernel.org/doc/html/latest/process/deprecated.html#bug-and-bug-on
> > which describes that.  We should make it more explicit that any WARN()
> > or WARN_ON() calls that can be hit by user interactions somehow, will
> > end up getting a CVE id when we fix it up to not do so.
>
> If the aformentioned WARN_ON hits, then the driver has probabaly
> already done a buffer overrun so it's a CVE anyway.

We'll see I finally got into testing this. Sorry for latencies, I'm
switching jobs and unfortunately German Post Office lost my priority
mail containing contracts (sent them from Finland to Berlin) so have
been signing, scanning etc. the whole day :-) My last week in the
current job, and next week is the first in the new job, so this
week is a bit bumpy.

BR, Jarkko
Jarkko Sakkinen Sept. 24, 2024, 5:43 p.m. UTC | #11
On Tue Sep 24, 2024 at 7:05 PM EEST, Jarkko Sakkinen wrote:
> On Tue Sep 24, 2024 at 1:32 AM EEST, Herbert Xu wrote:
> > On Mon, Sep 23, 2024 at 04:48:27PM +0200, Greg KH wrote:
> > >
> > > Please see:
> > > 	https://www.kernel.org/doc/html/latest/process/deprecated.html#bug-and-bug-on
> > > which describes that.  We should make it more explicit that any WARN()
> > > or WARN_ON() calls that can be hit by user interactions somehow, will
> > > end up getting a CVE id when we fix it up to not do so.
> >
> > If the aformentioned WARN_ON hits, then the driver has probabaly
> > already done a buffer overrun so it's a CVE anyway.
>
> We'll see I finally got into testing this. Sorry for latencies, I'm
> switching jobs and unfortunately German Post Office lost my priority
> mail containing contracts (sent them from Finland to Berlin) so have
> been signing, scanning etc. the whole day :-) My last week in the
> current job, and next week is the first in the new job, so this
> week is a bit bumpy.

I get nothing with this:

diff --git a/drivers/char/tpm/tpm2-cmd.c b/drivers/char/tpm/tpm2-cmd.c
index aba024cbe7c5..856a8356d971 100644
--- a/drivers/char/tpm/tpm2-cmd.c
+++ b/drivers/char/tpm/tpm2-cmd.c
@@ -341,12 +341,15 @@ int tpm2_get_random(struct tpm_chip *chip, u8 *dest, size_t max)

                dest_ptr += recd;
                total += recd;
+
+               WARN_ON(num_bytes < recd);
                num_bytes -= recd;
        } while (retries-- && total < max);

        tpm_buf_destroy(&buf);
        tpm2_end_auth_session(chip);

+       WARN_ON(total > max);
        return total ? total : -EIO;
 out:
        tpm_buf_destroy(&buf);

[WARN_ON()'s here are only for the temporary diff]

Call stack:

1. tpm2_get_random():
   https://elixir.bootlin.com/linux/v6.11-rc7/source/drivers/char/tpm/tpm2-cmd.c#L281
2. tpm_get_random():
   https://elixir.bootlin.com/linux/v6.11-rc7/source/drivers/char/tpm/tpm-interface.c#L430
3. tpm_hwrng_read():
   https://elixir.bootlin.com/linux/v6.11-rc7/source/drivers/char/tpm/tpm-chip.c#L524

Everything seems to have also appropriate range checks.

Without any traces that would provide more information I don't see
the smoking gun.

BR, Jarkko
Herbert Xu Sept. 27, 2024, 12:42 a.m. UTC | #12
On Tue, Sep 24, 2024 at 08:43:00PM +0300, Jarkko Sakkinen wrote:
>
> Without any traces that would provide more information I don't see
> the smoking gun.

I haven't confirmed that it's definitely the tpm2 driver, it's just
based on the backtrace.  Hopefully my patch will confirm it one way
or the other.  Here is the backtrace:

[  100.784159] vmd 0000:c2:00.5: Bound to PCI domain 10002 
[  100.786209] Monitor-Mwait will be used to enter C-1 state 
[  100.786225] Monitor-Mwait will be used to enter C-2 state 
[  100.786244] ACPI: \_SB_.SCK0.C000: Found 2 idle states 
[  100.823093] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0 
[  100.823636] ACPI: button: Power Button [PWRF] 
[  100.905756] ERST: Error Record Serialization Table (ERST) support is initialized. 
[  100.905858] pstore: Using crash dump compression: deflate 
[  100.905861] pstore: Registered erst as persistent store backend 
[  100.907044] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled 
[  100.908305] 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A 
[  100.926608] 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A 
[  100.942953] Non-volatile memory driver v1.3 
[  100.947908] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0x1B, rev-id 22) 
[  101.226913] ACPI: bus type drm_connector registered 
[  101.229708] alg: ecdh-nist-p256 (ecdh-nist-p256-generic) is disabled due to FIPS 
[  101.229745] tpm tpm0: crypto ecdh allocation failed 
[  101.236311] tpm tpm0: A TPM error (708) occurred start auth session 
[  101.238797] ================================================================== 
[  101.238800] BUG: KASAN: slab-out-of-bounds in blake2s_update+0x135/0x2b0 
[  101.238808] Read of size 44 at addr ff11000167334d98 by task hwrng/318 
[  101.238811]  
[  101.238813] CPU: 26 UID: 0 PID: 318 Comm: hwrng Not tainted 6.11.0-0.rc5.22.el10.x86_64+debug #1 
[  101.238818] Hardware name: Supermicro SSG-110P-NTR10-EI018/X12SPO-NTF, BIOS 1.3 05/20/2022 
[  101.238820] Call Trace: 
[  101.238823]  <TASK> 
[  101.238826]  dump_stack_lvl+0x6f/0xb0 
[  101.238833]  ? blake2s_update+0x135/0x2b0 
[  101.238836]  print_address_description.constprop.0+0x88/0x330 
[  101.238843]  ? blake2s_update+0x135/0x2b0 
[  101.238847]  print_report+0x108/0x209 
[  101.238851]  ? blake2s_update+0x135/0x2b0 
[  101.238855]  ? __virt_addr_valid+0x20b/0x440 
[  101.238859]  ? blake2s_update+0x135/0x2b0 
[  101.238863]  kasan_report+0xa8/0xe0 
[  101.238868]  ? blake2s_update+0x135/0x2b0 
[  101.238874]  kasan_check_range+0x10f/0x1f0 
[  101.238879]  __asan_memcpy+0x23/0x60 
[  101.238883]  blake2s_update+0x135/0x2b0 
[  101.238887]  add_hwgenerator_randomness+0x3d/0xe0 
[  101.238895]  hwrng_fillfn+0x144/0x270 
[  101.238900]  ? __pfx_hwrng_fillfn+0x10/0x10 
[  101.238904]  kthread+0x2d2/0x3a0 
[  101.238908]  ? __pfx_kthread+0x10/0x10 
[  101.238912]  ret_from_fork+0x31/0x70 
[  101.238917]  ? __pfx_kthread+0x10/0x10 
[  101.238920]  ret_from_fork_asm+0x1a/0x30 
[  101.238929]  </TASK> 
[  101.238931]  
[  101.238932] Allocated by task 1: 
[  101.238934]  kasan_save_stack+0x30/0x50 
[  101.238937]  kasan_save_track+0x14/0x30 
[  101.238940]  __kasan_kmalloc+0x8f/0xa0 
[  101.238942]  __kmalloc_noprof+0x1fe/0x410 
[  101.238947]  kobj_map+0x7e/0x6d0 
[  101.238951]  cdev_add+0x92/0x180 
[  101.238954]  tty_cdev_add+0x17a/0x280 
[  101.238957]  tty_register_device_attr+0x401/0x740 
[  101.238960]  tty_register_driver+0x381/0x6f0 
[  101.238963]  vty_init+0x2c1/0x2f0 
[  101.238967]  tty_init+0x13b/0x150 
[  101.238970]  do_one_initcall+0x11c/0x5c0 
[  101.238975]  do_initcalls+0x1b4/0x1f0 
[  101.238980]  kernel_init_freeable+0x4ae/0x520 
[  101.238984]  kernel_init+0x1c/0x150 
[  101.238988]  ret_from_fork+0x31/0x70 
[  101.238991]  ret_from_fork_asm+0x1a/0x30 
[  101.238994]  
[  101.238995] The buggy address belongs to the object at ff11000167334d80 
[  101.238995]  which belongs to the cache kmalloc-64 of size 64 
[  101.238998] The buggy address is located 24 bytes inside of 
[  101.238998]  allocated 56-byte region [ff11000167334d80, ff11000167334db8) 
[  101.239002]  
[  101.239003] The buggy address belongs to the physical page: 
[  101.239004] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x167334 
[  101.239008] flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff) 
[  101.239012] page_type: 0xfdffffff(slab) 
[  101.239016] raw: 0017ffffc0000000 ff1100010003c8c0 dead000000000122 0000000000000000 
[  101.239019] raw: 0000000000000000 0000000000200020 00000001fdffffff 0000000000000000 
[  101.239021] page dumped because: kasan: bad access detected 
[  101.239023]  
[  101.239024] Memory state around the buggy address: 
[  101.239025]  ff11000167334c80: 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc 
[  101.239028]  ff11000167334d00: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc 
[  101.239030] >ff11000167334d80: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc 
[  101.239031]                                         ^ 
[  101.239033]  ff11000167334e00: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc 
[  101.239035]  ff11000167334e80: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc 
[  101.239037] ================================================================== 
[  101.383067] rdac: device handler registered 
[  101.383412] hp_sw: device handler registered 
[  101.383415] emc: device handler registered 
[  101.383879] alua: device handler registered 
[  101.391255] xhci_hcd 0000:00:14.0: xHCI Host Controller 
[  101.391892] xhci_hcd 0000:00:14.0: new USB bus registered, assigned bus number 1 
[  101.393706] xhci_hcd 0000:00:14.0: hcc params 0x200077c1 hci version 0x100 quirks 0x0000000000009810 
[  101.399646] xhci_hcd 0000:00:14.0: xHCI Host Controller 
[  101.400136] xhci_hcd 0000:00:14.0: new USB bus registered, assigned bus number 2 
[  101.400163] xhci_hcd 0000:00:14.0: Host supports USB 3.0 SuperSpeed 
[  101.400818] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 6.11 
[  101.400823] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1 
[  101.400826] usb usb1: Product: xHCI Host Controller 
[  101.400829] usb usb1: Manufacturer: Linux 6.11.0-0.rc5.22.el10.x86_64+debug xhci-hcd 
[  101.400832] usb usb1: SerialNumber: 0000:00:14.0 
[  101.403055] hub 1-0:1.0: USB hub found 
[  101.403222] hub 1-0:1.0: 16 ports detected 
[  101.657974] usb usb2: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 6.11 
[  101.657982] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1 
[  101.657986] usb usb2: Product: xHCI Host Controller 
[  101.657990] usb usb2: Manufacturer: Linux 6.11.0-0.rc5.22.el10.x86_64+debug xhci-hcd 
[  101.657993] usb usb2: SerialNumber: 0000:00:14.0 
[  101.660659] hub 2-0:1.0: USB hub found 
[  101.660882] hub 2-0:1.0: 10 ports detected  {code}

Thanks,
Jarkko Sakkinen Oct. 7, 2024, 11:28 p.m. UTC | #13
On Fri, 2024-09-27 at 08:42 +0800, Herbert Xu wrote:
> On Tue, Sep 24, 2024 at 08:43:00PM +0300, Jarkko Sakkinen wrote:
> > 
> > Without any traces that would provide more information I don't see
> > the smoking gun.
> 
> I haven't confirmed that it's definitely the tpm2 driver, it's just
> based on the backtrace.  Hopefully my patch will confirm it one way
> or the other.  Here is the backtrace:

Agreed.

> 
> [  100.784159] vmd 0000:c2:00.5: Bound to PCI domain 10002 
> [  100.786209] Monitor-Mwait will be used to enter C-1 state 
> [  100.786225] Monitor-Mwait will be used to enter C-2 state 
> [  100.786244] ACPI: \_SB_.SCK0.C000: Found 2 idle states 
> [  100.823093] input: Power Button as
> /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0 
> [  100.823636] ACPI: button: Power Button [PWRF] 
> [  100.905756] ERST: Error Record Serialization Table (ERST) support
> is initialized. 
> [  100.905858] pstore: Using crash dump compression: deflate 
> [  100.905861] pstore: Registered erst as persistent store backend 
> [  100.907044] Serial: 8250/16550 driver, 4 ports, IRQ sharing
> enabled 
> [  100.908305] 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud =
> 115200) is a 16550A 
> [  100.926608] 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud =
> 115200) is a 16550A 
> [  100.942953] Non-volatile memory driver v1.3 
> [  100.947908] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0x1B, rev-id
> 22) 
> [  101.226913] ACPI: bus type drm_connector registered 
> [  101.229708] alg: ecdh-nist-p256 (ecdh-nist-p256-generic) is
> disabled due to FIPS 
> [  101.229745] tpm tpm0: crypto ecdh allocation failed 
> [  101.236311] tpm tpm0: A TPM error (708) occurred start auth

I guess it is TPM2_StartAuthSession returning TPM_RC_VALUE. Probably
James should look into this as the bus encryption code is clearly
tripping here.

I'm on second week on a new job so cannot promise any bandwidth
yet this week. Earliest next week...

BR, Jarkko
diff mbox series

Patch

diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
index 57c51efa5613..018316f54621 100644
--- a/drivers/char/hw_random/core.c
+++ b/drivers/char/hw_random/core.c
@@ -181,8 +181,15 @@  static inline int rng_get_data(struct hwrng *rng, u8 *buffer, size_t size,
 	int present;
 
 	BUG_ON(!mutex_is_locked(&reading_mutex));
-	if (rng->read)
-		return rng->read(rng, (void *)buffer, size, wait);
+	if (rng->read) {
+		int err;
+
+		err = rng->read(rng, buffer, size, wait);
+		if (WARN_ON_ONCE(err > 0 && err > size))
+			err = size;
+
+		return err;
+	}
 
 	if (rng->data_present)
 		present = rng->data_present(rng, wait);